NVIDIA TensorRT

NVIDIA? TensorRT? is an ecosystem of tools for developers to achieve high-performance deep learning inference. TensorRT includes inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes the TensorRT compiler, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.

Download Now Documentation
Forum

How TensorRT Works

Speed up inference by 36X compared to CPU-only platforms.

Built on the NVIDIA? CUDA? parallel programming model, TensorRT includes libraries that optimize neural network models trained on all major frameworks, calibrate them for lower precision with high accuracy, and deploy them to hyperscale data centers, workstations, laptops, and edge devices. TensorRT optimizes inference using techniques such as quantization, layer and tensor fusion, and kernel tuning.

TensorRT provides post-training quantization and support for models trained with quantization-aware training techniques for optimizing FP8, FP4, and integer formats for deep learning inference. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.

Read the Introductory TensorRT Blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Read Blog

Watch On-Demand TensorRT Sessions From GTC

Learn more about TensorRT and its features from a curated list of webinars at GTC.

Watch Sessions

Get the Complete Developer Guide

See how to get started with TensorRT in this step-by-step developer and API reference guide.

Read Guide

Navigate AI infrastructure and Performance

Learn how to lower your cost per token and get the most out of your AI models with our ebook.

View eBook

Key Features

Large Language Model Inference

NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of large language models (LLMs) on the NVIDIA AI platform with a simplified Python API.

Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUsâ€”including NVIDIA RTX? systems on native Windowsâ€”with the same seamless workflow.

Compile in the Cloud

NVIDIA TensorRT Cloud is a developer-focused service for generating hyper-optimized engines for given constraints and KPIs. Given an LLM and inference throughput/latency requirements, a developer can invoke TensorRT Cloud service using a command-line interface to hyper-optimize a TensorRT-LLM engine for a target GPU. The cloud service will automatically determine the best engine configuration that meets the requirements. Developers can also use the service to build optimized TensorRT engines from ONNX models on a variety of NVIDIA RTX, GeForce, Quadro?, or Tesla?-class GPUs. TensorRT Cloud is available in limited access to select partners. Apply for access, subject to approval.

Optimize Neural Networks

NVIDIA TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM and TensorRT to efficiently optimize inference on NVIDIA GPUs.

Major Framework Integrations

TensorRT integrates directly into PyTorch and Hugging Face to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson?, NVIDIA DRIVE?, and data center platforms.

Deploy, Run, and Scale With Triton

TensorRT-optimized models are deployed, run, and scaled with NVIDIA Triton? inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.

Accelerate Every Inference Platform

TensorRT can optimize models for applications across the edge, laptops and desktops, and data centers. It powers key NVIDIA solutionsâ€”such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara?, and NVIDIA JetPack?â€”and is integrated with application-specific SDKs, such as NVIDIA NIM?, NVIDIA DeepStream, NVIDIA? Riva, NVIDIA Merlin?, NVIDIA Maxine?, NVIDIA Morpheus, and NVIDIA Broadcast Engine.

TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.

Get Started With TensorRT

TensorRT is an ecosystem of APIs for high-performance deep learning inference.

Download TensorRT

The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications.

Download SDK

Download Container

Documentation

Download TRT-LLM

TensorRT-LLM is available for free on GitHub.

Download (GitHub)

Documentation

Download TensorRT Model Optimizer

TensorRT Model Optimizer is available for free on NVIDIA PyPI, with examples and recipes on GitHub.

Download (GitHub)

Documentation

Get Started With TensorRT Frameworks

TensorRT Frameworks add TensorRT compiler functionality to frameworks like PyTorch.

Download ONNX and Torch-TensorRT

The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications.

ONYX:

Documentation

Torch-TensorRT:

Download Container

Documentation

Experience Tripy: Pythonic Inference With TensorRT

Experience high-performance inference and excellent usability with Tripy. Expect intuitive APIs, easy debugging with eager mode, clear error messages, and top-notch documentation to streamline your deep learning deployment.

Documentation

Examples

Contribute

Deploy

Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.

Request a 90-Day License

World-Leading Inference Performance

TensorRT was behind NVIDIAâ€™s wins across all inference performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.

See All Benchmarks

8X Increase in GPT-J 6B Inference Performance

4X Higher Llama2 Inference Performance

Total Cost of Ownership

Lower is better

Energy Use

Lower is better

Starter Kits

TensorRT Ecosystem Ecosystem

Widely Adopted Across Industries

More Resources

Explore the Community

Get Training and Certification

Read Top Stories and Blogs

Ethical AI

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here .

Get started with TensorRT today, and use the right inference tools to develop AI for any application on any platform.

Download Now

NVIDIA TensorRT

How TensorRT Works

Read the Introductory TensorRT Blog

Watch On-Demand TensorRT Sessions From GTC

Get the Complete Developer Guide

Navigate AI infrastructure and Performance

Key Features

Large Language Model Inference

Compile in the Cloud

Optimize Neural Networks

Major Framework Integrations

Deploy, Run, and Scale With Triton

Accelerate Every Inference Platform

Get Started With TensorRT

Download TensorRT

Download TRT-LLM

Download TensorRT Model Optimizer

Get Started With TensorRT Frameworks

Download ONNX and Torch-TensorRT

Experience Tripy: Pythonic Inference With TensorRT

Deploy

World-Leading Inference Performance

8X Increase in GPT-J 6B Inference Performance

4X Higher Llama2 Inference Performance

Total Cost of Ownership

Energy Use

Starter Kits

Beginner Guide to TensorRT

Beginner Guide to TensorRT-LLM

Beginner Guide to TensorRT Model Optimizer

Beginner Guide to Torch-TensorRT

Beginner Guide to TensorRT Pythonic Frontend: Tripy

TensorRT Ecosystem Ecosystem

More Resources

Explore the Community

Get Training and Certification

Read Top Stories and Blogs

Ethical AI