Mutable.ai logoAuto Wiki by Mutable.ai

TensorRT

Auto-generated from NVIDIA/TensorRT by Mutable.ai Auto Wiki

TensorRT
GitHub Repository
DeveloperNVIDIA
Written inC++
Stars8.4k
Watchers 141
Created2019-05-02
Last updated2024-01-06
LicenseApache License 2.0
Homepagedeveloper.nvidia.com/tensorrt
RepositoryNVIDIA/TensorRT
Auto Wiki
Generated at2024-01-07
Generated fromCommit a1820e
Version0.0.4

TensorRT is an SDK for high-performance deep learning inference optimization and runtime for NVIDIA GPUs. It allows developers to optimize neural network models for fast, low-latency execution on NVIDIA hardware.

The key functionality provided by TensorRT includes:

  • The TensorRT Core Functionality implements the core C++ API for building and running optimized models on NVIDIA GPUs. This includes model loading, graph optimization, execution, and plugins. Tools for debugging and analysis are also provided.

  • TensorRT Plugins allow developers to integrate custom layers, operation plugins, and optimizations into TensorRT. This provides flexibility to optimize a wide range of models. Plugins implement functionality like ROI pooling, sparse tensor operations, normalization layers, etc.

  • TensorRT Python Support exposes the C++ API to Python, enabling pip installation and Python imports. Python Bindings use PyBind11 to bind the C++ API. Python Packaging handles building wheels and Loading TensorRT Libraries.

  • TensorRT Code Samples demonstrate C++ and Python workflows for tasks like model loading, deployment, calibration and leveraging different model types with TensorRT.

  • TensorRT Model Parsers allow importing models from frameworks like Caffe, ONNX, and UFF into TensorRT engines for execution.

  • TensorRT Docker Support provides Dockerfiles and scripts for setting up development environments.

  • TensorRT Demo Applications showcase end-to-end examples for models like ResNet, BERT, Tacotron-2, optimized with TensorRT.

The key design choice is providing a high-performance C++ API for inference on NVIDIA GPUs, while also enabling Python usage and integration of custom plugins and layers. The tools, samples and parsers simplify workflows.

TensorRT Core Functionality

References: TensorRT, tools

The core TensorRT C++ API provides the main functionality for building and optimizing deep learning models to run efficiently on NVIDIA GPUs. The API is defined in header files under the include directory. This includes interfaces for key components like the network definition and execution.

The main interfaces defined are:

  • Represents the network being built and allows adding operators and tensors through its methods. This is the primary interface for constructing networks programmatically.

  • Handles building engines from networks. Its method compiles the network into an optimized engine binary.

  • Represents the optimized engine artifact that can be saved/loaded and run on the GPU.

  • Manages launching and running the engine. Its method executes the engine on a batch asynchronously.

Plugins extend TensorRT through the interface.

Network building involves:

  1. Creating a network definition
  2. Adding operators like convolution and pooling
  3. Configuring the network through methods
  4. Building an engine from the network

Engines can then be run by:

  1. Creating an execution context from the engine
  2. Allocating buffers and setting bindings
  3. Enqueuing execution asynchronously

The tools directory contains utilities like Polygraphy for tasks like model conversion and debugging issues. The scripts folder has utilities for copyright management and stub library generation.

TensorRT C++ API

References: include

The TensorRT C++ API is defined in header files under the include directory. The key files that define the API include …/NvInferImpl.h and …/NvInferLegacyDims.h.

…/NvInferLegacyDims.h defines dimension classes for representing tensor shapes of different ranks.

These dimension classes provide a type-safe way to work with tensors of different ranks.

The key classes, functions, and algorithms implemented in these headers include dimension classes which represent tensor shapes.

TensorRT Tools

References: tools

The …/trt-engine-explorer directory provides a collection of tools for analyzing TensorRT engine artifacts and performance. It contains utilities for tasks like model analysis, conversion, and debugging.

The core functionality is contained within modules that are imported. The …/parser.py module contains important parsing functionality for reading the engine files and cleaning the raw data. The …/df_preprocessing.py module preprocesses the raw layer data for downstream use.

The toolkit provides various "views" of the engine data through reusable modules. The …/graphing.py module generates visualizations of the engine topology as a graph. The …/plotting.py module contains functions for creating different types of plots from the layer properties in the DataFrame. Additional reporting functionality is provided in …/excel_summary.py. Linting of layers is supported by classes in …/lint.py.

Utilities for tasks like building engines, parsing logs, and configuring GPUs are located in …/utils. Comprehensive tests are in …/tests. Documentation and resources are located in files like …/RESOURCES.md and …/README.md.

TensorRT Scripts

References: scripts

This section covers the script utilities provided under scripts. These scripts handle tasks related to managing copyright headers and generating libraries.

The main scripts are …/copyright-scan.py and …/stubify.sh.

…/copyright-scan.py allows recursively scanning files in a directory and its subdirectories to check and update copyright headers in files. It defines patterns to match different file types and standardized headers. Command line arguments provide configuration options for the root directory and dry run mode. Common directories are explicitly excluded from scanning.

…/stubify.sh generates "stub" shared libraries that can be used for testing without dependencies on the real implementation. It takes the path to an input library as the first argument and the output stub library path as the second argument. The script uses nm to extract all strong symbols from the input library. It then prints empty function definitions for each symbol, which are compiled into the output stub library using the CC compiler. The soname of the output stub library is set to match the input library.

TensorRT Python Support

References: python

The python directory contains code that provides Python bindings and packaging to access the TensorRT C++ API from Python. This allows tasks like model parsing, optimization, and inference to be performed directly from Python code.

The main functionality is organized into several key subdirectories:

  • …/src implements the core Python bindings that expose the full TensorRT C++ API to Python. This includes bindings for important parts like parsers, inference, and utilities.

  • …/docstrings contains docstrings that define the documentation for the TensorRT Python API. This includes docstrings for core types, classes, functions for tasks like calibration and plugins.

  • …/include defines the Python bindings for TensorRT by containing header files that allow TensorRT C++ APIs and classes to be accessed from Python.

  • …/packaging contains all code necessary to package the TensorRT Python API into distributable Python packages. This allows installation via pip and importing in Python.

The main business logic implemented is:

  • Defining the Python bindings in …/__init__.py

  • Automatically loading any native TensorRT libraries packaged with a wheel via …/__init__.py

  • Implementing common conversions for interfacing frameworks in …/utils.cpp

  • Packaging the module, docs, and libraries into a distributable wheel format

This allows the full TensorRT functionality to be accessed directly from Python via its standard mechanisms like imports and pip. The bindings, packaging, and documentation work together to provide a seamless Python interface.

TensorRT Quickstart Guides

References: quickstart

The quickstart directory contains introductory examples and tutorials for using TensorRT to optimize deep learning models for inference. It includes important subdirectories that provide sample code demonstrating common workflows:

  • The …/IntroNotebooks directory contains Jupyter notebooks and Python files that load models and make predictions with optimized models.

  • The …/SemanticSegmentation directory shows examples for performing semantic segmentation with TensorRT.

  • The …/common directory contains common utilities like functions defined in …/logger.h that are reused across samples.

  • The …/deploy_to_triton directory demonstrates deploying a model optimized with TensorRT to the Triton inference server. This includes code in …/triton_client.py for using a Triton client to run inference on the server.

The key functionality demonstrated in these examples includes model loading, conversion between formats like ONNX, building inference applications, optimization workflows, and deploying models to production services. The README and code samples provide tutorials that introduce TensorRT concepts through concrete examples and exercises.

TensorRT Docker Support

References: docker

The Dockerfiles and scripts in docker provide a standardized way to set up TensorRT development environments using Docker containers. The Dockerfiles define images for common operating systems like Ubuntu and CentOS that install TensorRT, its dependencies, and other development tools. Specific Dockerfiles target different OS versions and architectures.

The …/build.sh script builds images from the Dockerfiles. It constructs command strings, then executes commands.

The …/launch.sh script launches containers from the built images, mounting the local source directory and configuring runtime options. It builds command strings, then executes commands.

Key Dockerfiles include:

TensorRT Plugins

References: plugin

The plugin directory contains implementations of custom layers and optimizations that can be integrated into TensorRT as plugins. Plugins allow complex techniques to be optimized and run as part of the TensorRT inference pipeline. Some key capabilities provided by the plugins include:

Plugins are implemented as C++ classes that inherit from interfaces defined in files like …/plugin.h to integrate with TensorRT. Creators defined in files like …/batchedNMSPlugin.cpp handle plugin instantiation and lifecycle functions. This allows techniques to be optimized and run as part of the TensorRT optimization pipeline.

Common Plugins

References: plugin/common

The core utilities and base classes reused across plugins are contained in the …/common directory. This directory provides important interfaces and functionality through files like …/plugin.h.

…/plugin.h contains macros and utilities for tasks commonly needed by plugins like error handling and serialization.

The …/kernels subdirectory holds optimized CUDA kernel implementations for common deep learning operations through functions in files like …/kernel.h. These kernels handle the data-parallel aspects of algorithms.

Utilities exist for common tasks across plugin types, through files like …/cudaDriverWrapper.h and …/reducedMathPlugin.cpp.

Overall, these common plugin components provide a foundation that plugin developers can leverage to build custom layers for TensorRT more modularly and consistently.

Attention and Transformer Plugins

References: plugin/bertQKVToContextPlugin, plugin/skipLayerNormPlugin

The plugins in this section implement attention and transformer models like BERT. Key plugins handle the multi-head attention computation, which is a core component of transformer architectures.

The …/bertQKVToContextPlugin directory contains plugins for efficiently computing multi-head attention. The …/fused_multihead_attention subdirectory provides optimized CPU implementations of the multi-head attention computation. The …/fused_multihead_attention_v2 subdirectory provides optimized GPU implementations.

The header …/fused_multihead_attention_v2.h stores the parameters and buffers needed to run multi-head attention.

The …/zeroPadding2d.h file defines a class for padding inputs that handles padding transparently on the GPU to prepare inputs for the attention kernels.

The …/skipLayerNormPlugin implements skip layer normalization commonly used in transformer models like BERT.

Computer Vision Plugins

References: plugin/batchedNMSPlugin, plugin/cropAndResizePlugin, plugin/detectionLayerPlugin

The computer vision plugins implement common computer vision tasks like non-maximum suppression and bounding box processing. The …/batchedNMSPlugin directory contains a plugin that implements non-maximum suppression (NMS) for object detection models. NMS is used to suppress overlapping bounding boxes.

The main classes are defined in …/batchedNMSPlugin.h and implemented in …/batchedNMSPlugin.cpp.

The …/cropAndResizePlugin directory contains a plugin that implements a crop and resize layer. The main classes are defined in …/cropAndResizePlugin.h and implemented in …/cropAndResizePlugin.cpp.

The …/detectionLayerPlugin directory contains a plugin for object detection post-processing. The class is defined in …/detectionLayerPlugin.h and implemented in …/detectionLayerPlugin.cpp

Normalization Plugins

References: plugin/instanceNormalizationPlugin, plugin/groupNormalizationPlugin

The …/instanceNormalizationPlugin and …/groupNormalizationPlugin directories contain plugins that implement instance normalization and group normalization layers respectively. Instance normalization normalizes the input by calculating the mean and variance of each channel across a batch of examples, while group normalization divides channels into groups and normalizes within each group.

The core instance normalization computation is performed by functions defined in …/instanceNormFwd.h. This includes functions for calculating buffer sizes and executing the instance norm kernel on tensors. Similarly, the main computation for group normalization is performed by a function referenced in …/GroupNormalizationPlugin_PluginReference.py. This applies group normalization using NumPy functions.

Both plugins utilize common normalization utilities defined in files like …/instanceNormCommon.h. These include functions for loading data, arithmetic operations, activations, parallel reductions, and type conversions.

Sparse Operation Plugins

References: plugin/scatterPlugin

This section covers plugins implemented in TensorRT for sparse tensor operations. The …/scatterPlugin directory contains an implementation of a scatter plugin. The file …/scatterPlugin.cpp contains the main class which implements the core scatter logic.

The class handles the scatter computation by a method, which takes the input and output tensors and enqueues the scatter operation on the GPU. It also handles plugin lifecycle functions such as attaching to CUDA contexts and serialization. This allows the scatter operation to be optimized and run efficiently as part of the TensorRT engine.

The class is responsible for plugin instantiation. Its method instantiates a new class given metadata like the plugin name.

The file …/CMakeLists.txt contains CMake configuration needed to build the scatter plugin. It finds the C++ and CUDA source files in the directory and propagates these variables to parent CMake scripts, allowing the source files to be compiled.

In summary, these files implement a TensorRT plugin that allows scattering of tensor data to be optimized and run efficiently as part of inference graphs in TensorRT. The class handles the core scatter computation and lifecycle functions, while the class manages plugin instantiation.

Utility Operation Plugins

References: plugin/batchTilePlugin, plugin/clipPlugin

This section covers plugins that implement utility operations for neural networks like tiling. The …/batchTilePlugin directory contains a plugin that tiles the input tensor across the batch dimension, effectively replicating the input for each batch entry.

The core logic lives in a class defined in …/batchTilePlugin.h. This class takes two input tensors, with the second providing the template shape to tile across the batch axis of the first input. The tiling operation loops through the batch size, copying the second input tensor, and concatenating the copies along the batch dimension to construct the tiled output tensor.

A class in …/batchTilePlugin.h handles plugin instantiation.

No parameters are needed for the tiling since the input shapes already provide the necessary information. The CMake configuration in …/CMakeLists.txt collects the source files for building the plugin.

The …/clipPlugin directory implements a clipping plugin that limits output values to a specified min-max range. The core clipping operation is performed by a function defined in …/clip.h, which takes the input data and clips values in-place.

A class in …/clipPlugin.h contains the clipping logic by looping through input tensor values and calling a function on each one. It is constructed with the min and max clip values.

TensorRT Python Support

References: python

The TensorRT Python support code provides a seamless interface to access the high-performance TensorRT C++ API directly from Python. It handles tasks like defining Python bindings for TensorRT types and functions, packaging the module for distribution via pip, and automatically loading native TensorRT libraries.

The core functionality is implemented in directories like …/src. Here, files define the main TensorRT Python module and imports submodules for key parts of the API like inference and parsers. Files like …/utils.cpp provide common conversions between data types.

The bindings are defined using PyBind11 to expose the C++ API to Python. Classes wrap TensorRT C++ classes. Functions wrap C++ functions.

Packaging is handled in directories like …/packaging. Files in subdirectories like …/bindings_wheel and …/libs_wheel configure building Python wheels for installation via pip.

Documentation is generated from docstrings in …/docstrings. The strings define namespaces and describe classes and functions. This documentation is then exposed from the Python module.

Overall, the Python support code leverages modern Python tools to deliver a fully-featured interface to TensorRT's low-level performance. Developers can integrate deep learning models into applications while retaining high efficiency.

Python Bindings

References: python/src, python/include

The …/src directory implements the Python bindings for the TensorRT C++ API using PyBind11. The main implementation file is …/pyTensorRT.cpp, which defines the core TensorRT Python module and handles importing submodules that bind different parts of the API.

The …/parsers subdirectory binds parsers for popular models formats. For example, …/pyCaffe.cpp defines bindings for the Caffe parser class. This allows loading Caffe models directly from Python.

…/infer contains bindings for core inference functionality.

Utilities in …/utils.cpp handle common conversions between data types when interfacing frameworks.

…/ForwardDeclarations.h provides forward declarations of classes to bind, while …/utils.h contains helper functions for the bindings.

Python Packaging

References: python/packaging

The Python packaging directories handle distributing the TensorRT Python bindings as distributable Python wheels. This allows the bindings to be installed via pip and used from Python code.

The …/bindings_wheel directory contains the code to build a wheel package for the bindings. The key file is …/__init__.py which defines the bindings that expose the TensorRT C++ API to Python.

The …/libs_wheel directory handles packaging the TensorRT native libraries into wheels. The …/__init__.py file automatically loads any library files packaged with the wheel to make them available to Python code without additional load logic.

The …/frontend_sdist directory packages the API for distribution via pip. The main component is the tensorrt module in …/tensorrt which implements the wrapping of the C++ API.

The key business logic is defining the bindings in …/__init__.py and automatically loading libraries via …/__init__.py. This allows distributing TensorRT via standard Python mechanisms while providing access to the low-level C++ capabilities.

The …/setup.py file defines metadata and builds the wheel package for the bindings. It specifies package data files and ensures compatibility.

The …/setup.py configures building a wheel for the native libraries. It determines dependencies and packages the shared library files.

Python Documentation

References: python/docstrings

The …/docstrings directory contains documentation strings (docstrings) that define the Python API for TensorRT. These docstrings are exposed via Python's documentation generation tools to provide documentation to users of the TensorRT Python package.

The file …/pyPluginDoc.h contains docstrings.

The …/pyOnnxDoc.h file documents.

Calibration algorithms are documented in …/pyInt8Doc.h. This file contains docstrings describing.

The docstrings are defined in files like but do not contain implementation code. They serve to document the functionality implemented in other source files.

Loading TensorRT Libraries

References: python/packaging/libs_wheel/tensorrt_libs

The /__init__.py file in the …/tensorrt_libs directory automatically loads any TensorRT native library files that have been packaged with the Python wheel. This makes the native library functionality available for import and use in other modules without needing additional load code.

The file loops through the files in the directory, passing each one to a function to attempt loading it as a dynamic library. Any failures to load are ignored using a try/except block.

By loading all packaged libraries during import of the module, their functionality is exposed and can be directly used by any code that imports without requiring further logic. Loading is done upfront to avoid needing loading code elsewhere.

TensorRT Code Samples

References: samples, quickstart

The TensorRT code samples demonstrate end-to-end workflows for tasks like model loading, calibration, deployment, and leveraging different model types. Code is organized into the following areas:

  • The samples directory contains C++ and Python samples that show usage of various TensorRT APIs. This includes algorithms, building RNNs, dynamic shapes, INT8 precision, custom plugins, and I/O formats.

  • The quickstart directory provides introductory Jupyter notebooks and examples for common deep learning tasks like semantic segmentation, image classification, and model deployment to Triton.

Some key implementation details:

  • The …/common directory contains reusable utilities like data loading, engine building, inference execution, and performance reporting used across samples.

  • Notebooks in …/IntroNotebooks provide tutorials on common deep learning workflows optimized using TensorRT.

TensorRT C++ Sample Code

References: samples, samples/common, samples/sampleINT8API, samples/sampleCharRNN

The TensorRT C++ sample code provides examples that demonstrate common workflows for building and running inference using the TensorRT C++ API. Key code samples are located in the samples and …/common directories.

The …/common directory contains utilities that are reused across multiple samples. This includes functions for tasks like data loading and running inference. The utilities handle common boilerplate and allow the samples to focus on demonstrating TensorRT capabilities.

Many samples are implemented as C++ classes that encapsulate the end-to-end workflow. For example, classes in …/sampleCharRNN.cpp handle loading models, constructing networks using the API, building engines, and executing inference in a loop.

Samples cover common C++ workflows like model loading and engine building. The utilities and class-based approach promote code reuse and isolate complex logic. The detailed samples provide many examples of integrating TensorRT into full applications.

TensorRT Python Sample Code

References: samples/python, samples/python/efficientnet, samples/python/introductory_parser_samples

The TensorRT samples Python directory contains numerous examples that demonstrate common workflows using the TensorRT Python API for tasks like model loading, preprocessing, inference, evaluation, and deployment. Many of these samples are organized into subdirectories based on the type of model or operation they showcase.

The …/introductory_parser_samples directory contains introductory examples of basic usage. The scripts in this directory show how to build a TensorRT engine from an ONNX model and run inference on the engine to classify images.

The …/tensorflow_object_detection_api directory contains examples for object detection models.

The …/efficientnet directory implements an end-to-end pipeline for EfficientNet models.

The …/network_api_pytorch_mnist directory shows converting a trained PyTorch MNIST model to a TensorRT engine.

TensorRT Model Format Samples

References: samples/sampleOnnxMNIST, samples/python/network_api_pytorch_mnist

The samples in the directory …/sampleOnnxMNIST and …/network_api_pytorch_mnist demonstrate parsing models from different frameworks into TensorRT.

The …/sampleOnnxMNIST sample shows how to take an ONNX model of an MNIST classifier, parse it using a parser, build a TensorRT engine from the network, allocate buffers and run inference on the engine to classify digit images. It uses a parser to parse the ONNX model into a TensorRT network. The main class encapsulates building the engine from the network, as well as handling data I/O and buffer management when running inference on the engine context.

The …/network_api_pytorch_mnist sample demonstrates an end-to-end workflow of training a CNN model for MNIST classification using PyTorch, extracting the weights, building a TensorRT engine populated with these weights, and performing inference. Functions are also provided for loading input data and running the full inference workflow.

TensorRT Plugin Samples

References: samples/python/onnx_custom_plugin

The …/onnx_custom_plugin directory implements custom plugins for TensorRT. It contains code to build a plugin as a shared library, load the plugin at runtime, modify an ONNX model to use the custom plugin, build a TensorRT engine incorporating the plugin, and test the plugin implementation.

The subdirectory …/plugin implements a custom operation using CUDA and TensorRT plugins. It contains a C++ implementation of the hardmax operation using custom plugins.

The file …/CMakeLists.txt configures building the custom plugin as a shared library target, setting compiler flags and linking dependencies.

The file …/load_plugin_lib.py provides a way to dynamically load the custom plugin library at runtime by checking OS-specific paths and names.

The file …/model.py preprocesses an ONNX model by replacing nodes with the custom operation.

The file …/sample.py builds a TensorRT engine from the modified ONNX model, incorporating the custom plugin. It runs inference on samples.

The file …/test_custom_hardmax_plugin.py tests that the custom plugin implementation matches a NumPy reference.

TensorRT Deployment Samples

References: quickstart/deploy_to_triton

These samples demonstrate deploying TensorRT optimized models to production servers like Triton Inference Server for low latency prediction services. The …/deploy_to_triton directory provides end-to-end examples of deploying a ResNet50 image classification model to Triton.

The sample workflow exports a PyTorch ResNet50 model to ONNX format using …/export_resnet_to_onnx.py. This file loads the pretrained model, runs inference to get the expected inputs and outputs, and saves the model with specifications.

Triton is then configured by placing the exported ONNX model in its model repository format at /model_repository. A configuration file defines the model name and expected inputs/outputs.

A Triton client implemented in …/triton_client.py handles connecting to the server, preparing input data for the model, running inference requests, and retrieving outputs. It loads an image, creates objects to represent data for the model, and prints predictions.

The README at …/README.md discusses optimizing the model with TensorRT, setting up Triton with the model, and using the Python client to query the model for predictions. It provides an end-to-end workflow guide.

TensorRT Optimization Samples

References: samples/sampleINT8API, samples/python/tensorflow_object_detection_api

The samples demonstrating optimizations like INT8 calibration focus on lowering the precision of models from FP32 to INT8 for deployment on NVIDIA hardware. This provides a significant speedup with minimal loss in accuracy through the use of calibration.

The …/sampleINT8API directory contains a C++ sample that builds an INT8 engine from an ONNX model. It demonstrates setting the INT8 configuration and running inference.

The Python sample in …/tensorflow_object_detection_api shows a complete workflow for object detection models. When building the INT8 engine, it supports calibration for INT8 using arguments to optimize the model for lower precision. This loads calibration images.

Some key functionality related to optimization samples:

  • Input preprocessing converts images for models
  • Supports calibration for INT8 when building engines

TensorRT Model Parsers

References: parsers

The TensorRT parsers subsystem implements parsers that can import models from frameworks like Caffe and build them into TensorRT engines. This allows using pre-trained models with TensorRT.

The parsers leverage several key components:

  • The …/caffe directory contains implementations for the Caffe parser.

  • Common utilities like data type definitions are provided in …/common.

  • Configuration and build rules are defined in …/CMakeLists.txt.

The Caffe parser workflow involves parsing prototxt files to extract the network structure. Functionality is returned.

The parsers provide a consistent way to import models into TensorRT through common interfaces while handling differences in frameworks. This allows leveraging pre-trained models with TensorRT optimizations.

Caffe Model Parser

References: parsers/caffe, parsers/caffe/caffeParser, parsers/caffe/caffeWeightFactory

The …/caffe directory contains code for parsing Caffe models and converting them to TensorRT engines. It implements functionality to load and parse Caffe prototxt and protobuf files, extract network structure and layer configurations, load pretrained weights, and build equivalent TensorRT networks that can be used for inference.

The main components are:

  • The key parsing class is defined in …/caffeParser.h.

  • Lookup tables in …/opParsers map Caffe layers to parsing functions that initialize layers in TensorRT.

  • Individual layer parsing functions are defined in subdirectories like …/opParsers and extract layer attributes and weights.

The overall workflow is:

  1. The key parsing class loads models and begins parsing layers.

  2. Layer parsing functions are called based on the lookup tables.

  3. These functions initialize weights and extract layer configurations.

  4. Layers are added to build an equivalent TensorRT network definition.

Some important implementation details:

  • Lookup tables and individual layer parsers provide an extensible design.

  • Common utilities are defined in files like …/caffeMacros.h.

This parser enables working with Caffe models in TensorRT by mapping layers, weights, and attributes between frameworks during parsing.

UFF Model Parser

References: TensorRT

The …/common directory contains code for parsing models into TensorRT engines. It provides utilities used by parsers.

The parsing process works as follows:

  • The network structure is read into memory

  • Layers are converted to TensorRT plugin nodes

  • Weights and biases are deserialized

  • The parsed network is returned to be optimized and used for engine building

Key aspects of the implementation include:

  • Handling different layer types

  • Mapping layers to equivalent plugin types

  • Initializing plugin attributes from the layer config

This allows models exported from frameworks to be imported into TensorRT for optimization. The parser converts the topology and initializes parameters for the plugins.

Common Parser Utilities

References: parsers/common

The common parser utilities provide functionality that is reused across different parser implementations in TensorRT. At the core is the …/parserUtils.h header, which contains general purpose utilities to help with parsing models.

The summaries mention functions in this header, but do not provide the exact signatures, so for brevity those references are omitted. Operators are overloaded to allow pretty printing dimensions and types as strings for debugging. Additional functions help work with representations of tensor dimensions. A macro handles logging errors and returning from functions.

The …/half.h header simplifies usage of half precision types by providing typedefs and handling compiler warnings.

Together these common parser utility headers provide generic functions, types, and helpers that underpin the implementation of individual model parsers in TensorRT. They handle common lower-level tasks like working with tensors and data types so the parsers can focus on higher-level model parsing logic.

TensorRT Docker Support

References: docker

The TensorRT Docker Support code provides tools for setting up development environments using Docker containers. This improves the workflow for contributors by ensuring a consistent environment.

The docker directory contains Dockerfiles, scripts, and other files that automate the setup of these environments. The Dockerfiles define images for common operating systems like Ubuntu and CentOS that will be used as bases. They install all necessary dependencies through package managers, configure environment variables, and create a non-root user for development activities.

The …/build.sh script handles building images from the Dockerfiles. It parses command line arguments to customize the build, such as specifying tags. Images can then be rebuilt easily if requirements change.

The …/launch.sh script launches containers from the built images. It mounts the local source code directory inside the container and sets runtime options like the number of GPUs to expose. This allows developing directly within the container using the same environment that builds will occur in.

The …/ubuntu-cross-aarch64.Dockerfile Dockerfile sets up a cross-compilation environment. It extracts prebuilt binary packages rather than compiling complex dependencies, and creates stub libraries so the target libraries can be linked against during builds. This allows compiling for other architectures without needing a full cross-compile toolchain installed locally.

Dockerfiles

References: docker/centos-7.Dockerfile, docker/ubuntu-18.04.Dockerfile, docker/ubuntu-20.04.Dockerfile

The Dockerfiles define Docker images that can be used to set up TensorRT development environments. The key Dockerfiles are:

  • …/centos-7.Dockerfile:

    • Installs CUDA, cuDNN and TensorRT from NVIDIA repositories
    • Adds a user for development
    • Sets environment variables and paths for TensorRT
  • …/ubuntu-18.04.Dockerfile:

    • Installs build dependencies and Python packages
    • Adds a non-root user and configures permissions
    • Sets environment variables to specify library and binary paths
  • …/ubuntu-20.04.Dockerfile:

    • Installs CUDA, cuDNN and other system libraries
    • Installs TensorRT packages for specified versions
    • Sets environment variables and the work directory /workspace

The Dockerfiles primarily install TensorRT and its dependencies using package managers. They also create a non-root user to isolate development activities. Key steps include setting environment variables and configuring paths. This allows developers to quickly start developing against TensorRT without installing libraries manually on each system.

Build and Launch Scripts

References: docker/build.sh, docker/launch.sh

The scripts …/build.sh and …/launch.sh are used to build and launch Docker containers for the TensorRT project.

…/build.sh parses command line arguments to determine the Dockerfile path and image name. It conditionally includes arguments during the build process. The script constructs a command string based on the parsed arguments and executes it to build the container.

…/launch.sh allows launching a TensorRT container with configurable options. It uses a while loop to parse arguments passed to the script, setting variables for things like the Docker image tag and number of GPUs. The script builds up a string conditionally based on the argument values. Options like mounting directories and setting the image name are added. The final command is printed and executed to launch the container.

Both scripts provide an interface for building and launching TensorRT containers with different configurations in a parameterized way based on command line inputs. They handle tasks like argument parsing, command string construction, and executing commands without relying on external libraries.

Cross Compilation Support

References: docker/ubuntu-cross-aarch64.Dockerfile

The …/ubuntu-cross-aarch64.Dockerfile sets up a cross-compilation environment for building TensorRT and related libraries for ARM64 targets like Raspberry Pi. It extracts prebuilt binary packages for CUDA, cuDNN and TensorRT into the /pdk_files/ directory rather than compiling from source. This avoids needing to cross-compile complex dependencies.

It then sets symlinks from the target header file locations under /pdk_files/ to the host include paths. For example, it symlinks /pdk_files/cudnn/usr/include/aarch64-linux-gnu to /usr/include.

To allow linking against the target libraries during compilation, it creates stub libraries that just export empty symbols. For example:

Wrap
Copy
CC=/pdk_files/tensorrt/lib/stubs/libnvinfer.so

This allows the cross-compiler to link without errors while the real implementation is provided at runtime on the target system.

The environment is set up to use these target library paths and the build output directory for any builds initiated in the container. This allows compiling applications and libraries for ARM64 using the cross-compiler while still having access to the target library headers and stubs.

TensorRT Demo Applications

References: demo

This section demonstrates end-to-end deep learning applications built with TensorRT. Key demos include:

  • Object Detection: The …/EfficientDet directory contains sample code showing EfficientDet usage for detection.

  • Speech Recognition: The …/Jasper directory contains an example of accelerating a pre-trained model for low-latency speech recognition.

  • Text-to-Speech: The …/tensorrt directory contains code for converting models to engines.

  • Image Generation: Models in …/Diffusion perform generation.

  • Language Models: The …/HuggingFace directory contains demos using the Neural Network Driven Framework.

Image Classification and Object Detection Models

References: demo/EfficientDet

The demos in this section showcase popular models for image classification and object detection that have been optimized for TensorRT. The …/EfficientDet directory contains Jupyter notebooks and Python files that demonstrate end-to-end usage of EfficientDet models for object detection.

The notebooks in …/notebooks load a pretrained Keras EfficientDet model into TensorRT and perform inference on images. They demonstrate the key steps of loading models and running inference on images.

Overall, these demos provide a complete workflow for using popular computer vision models optimized with TensorRT, from loading models to running inference. They allow users to leverage pre-trained models for classification and detection tasks with high performance.

Speech Recognition Models

References: demo/Jasper, demo/Tacotron2

This section discusses demos for speech recognition models like Jasper, Tacotron 2, and WaveGlow contained in the demo directory. The …/Jasper directory contains code demonstrating low-latency speech recognition inference using a pre-trained Jasper model. A class handles loading an optimized TensorRT engine generated from the PyTorch Jasper model. This class provides methods to preprocess input audio by extracting log mel filterbank features and normalizing the data. The /notebooks subdirectory contains a Jupyter notebook that loads a Jasper model checkpoint, converts it to an engine, and runs inference on sample audio to recognize speech in real-time.

The …/Tacotron2 directory contains an end-to-end text-to-speech system using for mel spectrogram generation from text and for waveform synthesis from mel spectrograms. A function orchestrates running the full TTS pipeline on sample text by tokenizing input and running the models. The /tensorrt subdirectory contains code to convert the PyTorch models to ONNX and generate TensorRT engines to accelerate inference.

Generative and GAN Models

References: TensorRT

The …/Diffusion directory contains demos for accelerating generative models like Glow and StyleGAN using TensorRT. It provides examples of text-to-image generation, image-to-image translation, and image inpainting with diffusion models.

Classes in …/models.py handle exporting models between formats by inheriting from a base export class.

Jupyter notebooks demonstrate full workflows like converting PyTorch diffusion models to TensorRT engines. The main entry point loads scripts to run requested models or tasks.

Pipelines for tasks such as text-to-image generation are implemented in classes under directories like …/txt2img_pipeline.py. These encapsulate end-to-end workflows.

Model classes in …/models.py define interfaces for diffusion models as TensorRT plugins. Optimization profiles are defined separately for Glow models.

Python demo scripts in the top level directory implement complete pipelines for each task, abstracting away TensorRT details. This accelerates diffusion model deployment while maintaining usability.