NVIDIA TensorRT
NVIDIA? TensorRT? is an ecosystem of APIs for high-performance deep learning inference. TensorRT includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.
NVIDIA TensorRT Benefits
Speed Up Inference by 36X
NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT optimizes neural network models trained on all major frameworks, calibrates them for lower precision with high accuracy, and deploys them to hyperscale data centers, workstations, laptops, and edge devices.
Optimize Inference Performance
TensorRT, built on the CUDA? parallel programming model, optimizes inference using techniques such as quantization, layer and tensor fusion, and kernel tuning on all types of NVIDIA GPUs, from edge devices to PCs to data centers.
Accelerate Every Workload
TensorRT provides post-training and quantization-aware training techniques for optimizing FP8, INT8, and INT4 for deep learning inference. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.
Deploy, Run, and Scale With Triton
TensorRT-optimized models are deployed, run, and scaled with NVIDIA Triton? inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.
Explore the Features and Tools of NVIDIA TensorRT
Large Language Model Inference
NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of recent large language models (LLMs) on the NVIDIA AI platform. Developers experiment with new LLMs for high performance and quick customization with a simplified Python API.
Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUs—including NVIDIA RTX? systems on native Windows—with the same seamless workflow.
Optimized Inference Engines
NVIDIA TensorRT Cloud is a developer service for compiling and creating optimized inference engines for ONNX. Developers can use their own model and choose the target RTX GPU. Then TensorRT Cloud builds the optimized inference engine, which can be downloaded and integrated into an application. TensorRT Cloud also provides prebuilt, optimized engines for popular LLMs on RTX GPUs.
TensorRT Cloud is available in early access on NVIDIA GeForce RTX? GPUs to select partners. Apply to be notified when it's publicly available.
Optimize Neural Networks
NVIDIA TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM and TensorRT to efficiently optimize inference on NVIDIA GPUs.
Major Framework Integrations
TensorRT integrates directly into PyTorch, Hugging Face, and TensorFlow to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson?, NVIDIA DRIVE?, and data center platforms.
World-Leading Inference Performance
TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.
See All Benchmarks8X Increase in GPT-J 6B Inference Performance
4X Higher Llama2 Inference Performance
Total Cost of Ownership
Energy Use
Accelerate Every Inference Platform
TensorRT can optimize AI deep learning models for applications across the edge, laptops and desktops, and data centers. It powers key NVIDIA solutions, such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara?, and NVIDIA JetPack?.
TensorRT is also integrated with application-specific SDKs, such as NVIDIA NIM, NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin?, NVIDIA Maxine?, NVIDIA Morpheus, and NVIDIA Broadcast Engine. TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.
From creator apps to games and productivity tools, TensorRT is embraced by millions of NVIDIA RTX, GeForce?, Quadro? GPU users. Whether integrated directly or via the ONNX-Runtime framework, TensorRT-optimized engines are weightless and compressed, empowering developers to incorporate AI-rich features without bloating app sizes.
Read Success Stories
Amazon
Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.
American Express
American Express improves fraud detection by analyzing tens of millions of daily transactions 50X faster. Find out how.
Zoox
Explore how Zoox, a robotaxi startup, accelerated their perception stack by 19X using TensorRT for real-time inference on autonomous vehicles.
Widely Adopted Across Industries
TensorRT Resources
Read the Introductory TensorRT Blog
Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.
Watch On-Demand TensorRT Sessions From GTC
Learn more about TensorRT and its features from a curated list of webinars at GTC.
Get the Introductory Developer Guide
See how to get started with TensorRT in this step-by-step developer and API reference guide.
Use the right inference tools to develop AI for any application on any platform.