NVIDIA TensorRT for Inference Optimization Cheat Sheet

Updated 2026-05-21

Next Topic: NVIDIA Triton Inference Server Cheat Sheet

NVIDIA TensorRT is a high-performance deep learning inference SDK that compiles trained neural networks into optimized inference engines for NVIDIA GPUs. It sits at the intersection of model deployment and hardware-level execution, transforming framework-trained models — typically exported as ONNX — into GPU-specific execution plans that exploit layer fusion, precision calibration, and kernel auto-tuning. The core value proposition is throughput and latency reduction: a model that runs at 100 ms in PyTorch may execute in under 10 ms after TensorRT compilation. The critical mental model is that TensorRT builds once, runs many times — the expensive tactic selection and kernel benchmarking happen at engine build time, so the resulting serialized engine is a highly specialized binary artifact tied to a specific GPU architecture, TensorRT version, and input shape range.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 131 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Build-Phase API ClassesTable 2: ONNX-to-TensorRT Conversion WorkflowTable 3: Precision Modes and BuilderFlagsTable 4: Optimization Profiles and Dynamic ShapesTable 5: Layer Fusion and Graph OptimizationsTable 6: Kernel Auto-Tuning and Tactic SelectionTable 7: INT8 CalibrationTable 8: Explicit Quantization with Q/DQ LayersTable 9: Runtime Execution APITable 10: Engine Serialization and RefitTable 11: CUDA Graphs and Multi-Stream ExecutionTable 12: Profiling and Performance AnalysisTable 13: Custom Plugins (IPluginV3)Table 14: TensorRT-LLM for Transformer InferenceTable 15: Advanced BuilderConfig Options and Hardware CompatibilityTable 16: NVIDIA Model Optimizer and SparsityTable 17: Torch-TensorRT and Framework Integration

Table 1: Core Build-Phase API Classes

The three objects you construct every time you build a TensorRT engine are the Builder, the Network, and the BuilderConfig. Understanding each class's responsibility — and the order in which they are used — is the minimum prerequisite for any TensorRT workflow.

Class	Example	Description
IBuilder	`builder = trt.Builder(logger)` `engine_bytes = builder.build_serialized_network(network, config)`	• Top-level factory that produces the serialized engine • also creates the Network and Config objects via `create_network()` and `create_builder_config()`.
INetworkDefinition	`network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))`	• Graph container where layers and tensors are defined • must be created with `STRONGLY_TYPED` or `EXPLICIT_BATCH` flag in TensorRT 10+.
IBuilderConfig	`config = builder.create_builder_config()` `config.set_flag(trt.BuilderFlag.FP16)`	Controls all build-time knobs: precision flags, timing cache, workspace memory, auxiliary streams, optimization level, and hardware compatibility.

Table 1: Core Build-Phase API Classes

Class	Example	Description
IBuilder	`builder = trt.Builder(logger)` `engine_bytes = builder.build_serialized_network(network, config)`	• Top-level factory that produces the serialized engine • also creates the Network and Config objects via `create_network()` and `create_builder_config()`.
INetworkDefinition	`network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))`	• Graph container where layers and tensors are defined • must be created with `STRONGLY_TYPED` or `EXPLICIT_BATCH` flag in TensorRT 10+.
IBuilderConfig	`config = builder.create_builder_config()` `config.set_flag(trt.BuilderFlag.FP16)`	Controls all build-time knobs: precision flags, timing cache, workspace memory, auxiliary streams, optimization level, and hardware compatibility.