RESTful Inference with the TensorRT Container and NVIDIA GPU Cloud

You’ve built, trained, tweaked and tuned your model. Finally, you have a Caffe, ONNX or TensorFlow model that meets your requirements. But you’re not done: now you need an inference solution, and you need to deploy to a datacenter or to the cloud. You need to get the maximum possible performance. You’ve heard that TensorRT can maximize inference performance on NVIDIA GPUs, but how do you get from your trained model to a TensorRT-based inference engine in your datacenter or in the cloud? The new TensorRT container can help you solve this problem.

TensorRT Container available in the NVIDIA GPU Cloud

Based on NVIDIA Docker, the TensorRT container encapsulates all the libraries, executables and drivers you need to develop a TensorRT-based inference application. In just a few minutes you can go from nothing to having a local development environment for your inference solution that can also act as the basis for your own container-based datacenter or cloud deployment.

In this post I’ll introduce the TensorRT container and describe the simple REST server included in the container, which can act as a basis or inspiration for your own deployment solution. If you are new to TensorRT or are currently using an earlier version, I recommend reading the following blog post on new features and capabilities in TensorRT 3: “TensorRT 3: Faster TensorFlow inference and Volta support”.

Getting the TensorRT Container

Before you can use the TensorRT container you need to install some software as described in the Installing Docker and NVIDIA Docker section from the NVIDIA Docker blogpost.

The TensorRT container is available as part of the NVIDIA GPU Cloud. To get the container you first need to create an account at https://ngc.nvidia.com as described in the following video.

Once you’ve created your account you can use your API key to login and then pull the container (you should pull the most recent version of the container). You can pull the container to your local system or use the AWS AMI named “NVIDIA Volta Deep Learning AMI” and pull to an AWS instance, as described in the video.

$ docker pull nvcr.io/nvidia/tensorrt:17.12

Now that you have the container there are multiple ways to use it, depending on whether you want to do TensorRT development or use the example REST server. For now, let’s just take a quick look at what is in the container. Running the following will put you in a bash shell within the container.

$ nvidia-docker run -it --rm nvcr.io/nvidia/tensorrt:17.12

The /workspace/tensorrt directory contains TensorRT samples that you can modify, build and run. For the REST server, example models and scripts for Caffe, ONNX and TensorFlow models are in /workspace/tensorrt_server and the source code is in /opt/gre/tensorrt. I’ll explain all these in more detail below.

TensorRT Development

The container includes all the TensorRT C++ and Python examples as well as the development tools required to build and execute those samples. You can modify the samples or create your own inference implementations from scratch.

To build the C++ you just need to run make.

$ cd /workspace/tensorrt/samples
$ make

The sample binaries are placed in /workspace/tensorrt/bin. For details on how to run each sample see the TensorRT Developer Guide.

$ cd /workspace/tensorrt/bin
$ ./sample_mnist

The TensorRT Python examples are also available and equally easy to execute.

$ cd /workspace/tensorrt/python/examples
$ python mnist_api.py ../data

REST Server

The tensorrt_server application included in the container utilizes TensorRT to perform inference using a Caffe, ONNX or TensorFlow model. The server currently supports inference only for image-classification-like networks so the model must conform to these restrictions:

  • The model must have a single image input (jpeg, png, and other formats are supported).
  • The model must have a single output that is used for classification. This output is flattened into a vector.  Each index in that vector corresponds to a class.

You associate a text label with each classification by providing a labels file where each line associates a label with the corresponding classification index. You can find an example model and labels for each supported format in /workspace/tensorrt_server.

  • The files mnist.prototxt and mnist.caffemodel define a Caffe model for classifying handwritten digits. The file mnist_labels.txt contains the classification labels on 10 sequentially labeled lines, one for each class the network is trained to identify.
  • The file inception_v1.pb defines an ONNX model for the Inception V1 network trained using the ImageNet dataset. The file imagenet_labels.txt contains the classification labels.
  • The file resnet_v1_152_frozen.pb defines a TensorFlow model for the Resnet-152 network trained using the ImageNet dataset. The file imagenet_labels.txt contains the classification labels.

Each model includes a helper script that launches tensorrt_server with the appropriate arguments (you can get more information about these arguments from the TensorRT Container Release Notes). These scripts are caffe_mnist, onnx_inception_v1 and tensorflow_resnet. I’ll show you one of these scripts in more detail in the next section.

Let’s run the onnx_inception_v1 script within the container and see how it provides a REST endpoint that you can use for inference.

$ nvidia-docker run -it --rm -p 8000:8000 nvcr.io/nvidia/tensorrt:17.12 tensorrt_server/onnx_inception_v1

=====================
== NVIDIA TensorRT ==
=====================

NVIDIA Release 17.12 (build 250066)

NVIDIA TensorRT 3.0.1 (c) 2016-2017, NVIDIA CORPORATION.  All rights reserved.
GPU Rest Engine (c) 2016-2017, NVIDIA CORPORATION.  All rights reserved.
Container image (c) 2017, NVIDIA CORPORATION.  All rights reserved.

https://developer.nvidia.com/tensorrt

================================================
tensorrt_server -t onnx -d float16 -i data_0 -o prob_1 -m /workspace/tensorrt_server/inception_v1.pb -l /workspace/tensorrt_server/imagenet_labels.txt
================================================
I1129 21:37:02.807007       1 main.go:116] Initializing TensorRT classifier
I1129 21:37:02.807215     1 classification.cpp:776] Converting model /workspace/tensorrt_server/inception_v1.pb
I1129 21:37:13.183311     1 classification.cpp:803] Initializing TensorRT engine on device 0
I1129 21:37:13.196447     1 classification.cpp:191] ... Importing TensorRT engine
I1129 21:37:13.196511     1 classification.cpp:96] Added linear block of size 3211264
I1129 21:37:13.196521     1 classification.cpp:96] Added linear block of size 2323200
I1129 21:37:13.196527     1 classification.cpp:96] Added linear block of size 746496
I1129 21:37:13.196532     1 classification.cpp:96] Added linear block of size 346112
I1129 21:37:14.457584       1 main.go:126] Adding AJAX form at /
I1129 21:37:14.457634       1 main.go:128] Adding REST endpoint /api/classify
I1129 21:37:14.457640       1 main.go:130] Starting server listening on :8000

In this output you can see the container starting (the “NVIDIA TensorRT” banner) and then the onnx_inception_v1 script prints the tensorrt_server command line and then invokes that command line to start the server. The rest of the output is from tensorrt_server itself as it imports the ONNX model and initializes TensorRT. The Starting server listening on :8000 tells us that the REST server is ready and listening on port 8000. Because you started the container with -p 8000:8000, Docker maps port 8000 within the container to port 8000 on the host, so you can access the REST server from localhost (or 127.0.0.1) on port 8000. The REST server endpoint is /api/classify.

My dog, Rex.
My dog, Rex.

Now use curl to post a request with the body holding the image that you want classified. The response is a JSON formatted string indicating the top three classifications. Let’s see what it says for my dog, Rex.

$ curl --data-binary @rex.jpg \
    http://127.0.0.1:8000/api/classify
[
{ "confidence" : 0.5003, "label" : "DINGO" },
{ "confidence" : 0.4035, "label" : "KELPIE" },
{ "confidence" : 0.0692, "label" : "BASENJI" }
]

Dingo?! Rex is a German Shepherd / Border Collie mix, but I can see the resemblance.

The REST server also exposes an HTML page at http://127.0.0.1:8000/ that you can access from a browser to interactively select an image and see the resulting classifications. Behind the scenes this HTML page is just using the same api/classify endpoint that you used in your curl command.

Using Your Own Model

Of course, what you really want the REST server to do is use your own model for inference. Let’s walk through what you need to do to get your trained TensorFlow image-classification model into the REST server. Similar steps apply if you have a Caffe or ONNX model.

First you must freeze your TensorFlow model to make it suitable for inference. The required steps are available from the TensorRT Developer Guide. Let’s assume you’ve saved the frozen model into /home/dev/mymodels/model_frozen.pb. You must also create a labels file as described above. Save this file as /home/dev/mymodels/model_labels.txt.

Now that you have your trained model and labels you need to make those files available within the container. Do this by using the docker --mount flag to map /home/dev/mymodels on your host system to /tmp/mymodels within the container.

$ nvidia-docker run -p 8000:8000 -it --rm --mount type=bind,source=/home/dev/mymodels,target=/tmp/mymodels nvcr.io/nvidia/tensorrt:17.12
/workspace# ls /tmp/mymodels
model_frozen.pb
model_labels.txt

Next you need to run the tensorrt_server executable with the appropriate options. The easiest way to do this is to copy the example from /workspace/tensorrt_server/tensorflow_resnet and modify it appropriately. Specify the name of the input and output nodes in your model and also the format of the input data. The required arguments for a Caffe or ONNX model are slightly different, so be sure to start with the corresponding example script (caffe_mnist or onnx_inception_v1) if your model is in one of those formats.

/workspace/tensorrt_server# cp tensorflow_resnet tensorflow_mymodel
/workspace/tensorrt_server# cat tensorflow_mymodel
#!/bin/bash

SERVER_EXEC=tensorrt_server
MODEL=/tmp/mymodels/model_frozen.pb
LABELS=/tmp/mymodels/model_labels.txt
INPUT_NAME=myinput
INPUT_FORMAT=float32,3,224,224
OUTPUT_NAME=myoutput
INFER_DTYPE=float16
CMD="$SERVER_EXEC -t tensorflow -d $INFER_DTYPE -i $INPUT_NAME -f $INPUT_FORMAT -o $OUTPUT_NAME -m $MODEL -l $LABELS"

$CMD

Make sure you set INPUT_NAME, INPUT_FORMAT and OUTPUT_NAME appropriately for your model. Now you’re ready to run the REST server using your model. Simply execute the script you just created to start the server.

/workspace/tensorrt_server# bash ./tensorflow_mymodel

As before use curl to post images to the server, only now the server will be using your model to perform the classification and will report the classification using your labels.

$ curl --data-binary @<image file> http://127.0.0.1:8000/api/classify

Modifying the REST Server

You can examine and modify the source of the REST server in /opt/gre/tensorrt. The server is based on the NVIDIA GPU Rest Engine (GRE) and uses Go to provide the HTTP front-end in main.go and implements the bulk of the preprocessing and inference logic in C++ in classification.cpp.

You can rebuild the server by issuing the following command within the container.

$ go get -ldflags="-s" tensorrt_server

Try TensorRT Container Today

The TensorRT container provides an easy-to-use environment for you to develop a TensorRT-based inference solution for your datacenter or cloud. Get started by heading over to the NVIDIA GPU Cloud. Learn more about the TensorRT container in the Release Notes and more about TensorRT in the Developer Guide. If you prefer to install and run TensorRT natively, deb and tar install packages are also available. Visit the TensorRT product page to learn more and download.

For more information on how TensorRT and NVIDIA GPUs deliver high-performance and efficient inference resulting in dramatic cost savings in the data center and power savings at the edge, refer to the following technical whitepaper: GPU Inference Performance Study Whitepaper

Use the comments below to ask questions, and be sure to let us know about the interesting ways you leverage TensorRT container for your inference solution.

No Comments