Accelerating Recommendation System Inference Performance with TensorRT

NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. You can import trained models from every deep learning framework into TensorRT, easily create highly efficient inference engines that can be incorporated into larger applications and services.

This video demonstrates the steps for using NVIDIA TensorRT to optimize a Multi Layered Perceptron based Recommender System that is trained on the MovieLens Dataset.

Five key things from this video:

Importing trained TensorFlow model into TensorRT is extremely easy with the help of Universal Framework Format(UFF) toolkit included in TensorRT.
You can add an extra layer to the Trained model even after importing it into TensorRT.
You can serialize the Engine to a memory block, which you could then serialize to a file or stream. This eliminates the need to perform an optimization step again.
Although the model is trained with higher precision(FP32), TensorRT provides flexibility to do inference with lower precision(FP16).
TensorRT 4 includes new operations such as Concat, Constant, and TopK plus optimizations for multilayer perceptrons to speed up inference performance of recommendation systems.

If this intrigues you, you can find more information as below:

Find code used in the video at: sampleMovieLens

Find Jupyter Notebook used in the video at: sampleMLP-notebook

Learn more about TensorRT at: https://developer.nvidia.com/tensorrt