NVIDIA TensorRT is a high-performance deep learning inference SDK that compiles trained neural networks into optimized inference engines for NVIDIA GPUs. It sits at the intersection of model deployment and hardware-level execution, transforming framework-trained models β typically exported as ONNX β into GPU-specific execution plans that exploit layer fusion, precision calibration, and kernel auto-tuning. The core value proposition is throughput and latency reduction: a model that runs at 100 ms in PyTorch may execute in under 10 ms after TensorRT compilation. The critical mental model is that TensorRT builds once, runs many times β the expensive tactic selection and kernel benchmarking happen at engine build time, so the resulting serialized engine is a highly specialized binary artifact tied to a specific GPU architecture, TensorRT version, and input shape range.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 131 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Build-Phase API Classes
The three objects you construct every time you build a TensorRT engine are the Builder, the Network, and the BuilderConfig. Understanding each class's responsibility β and the order in which they are used β is the minimum prerequisite for any TensorRT workflow.
| Class | Example | Description |
|---|---|---|
builder = trt.Builder(logger)engine_bytes = builder.build_serialized_network(network, config) | Top-level factory that produces the serialized engine; also creates the Network and Config objects via create_network() and create_builder_config(). | |
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)) | Graph container where layers and tensors are defined; must be created with STRONGLY_TYPED or EXPLICIT_BATCH flag in TensorRT 10+. | |
config = builder.create_builder_config()config.set_flag(trt.BuilderFlag.FP16) | Controls all build-time knobs: precision flags, timing cache, workspace memory, auxiliary streams, optimization level, and hardware compatibility. |