• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • New NVIDIA NIM Agent Blueprints now available   Get Started

    NVIDIA TensorRT

    NVIDIA? TensorRT? is an ecosystem of APIs for high-performance deep learning inference. TensorRT includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.

    Download NowGet Started


    NVIDIA TensorRT Benefits

    TensorRT speeds up inference by 36X

    Speed Up Inference by 36X

    NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT optimizes neural network models trained on all major frameworks, calibrates them for lower precision with high accuracy, and deploys them to hyperscale data centers, workstations, laptops, and edge devices.

    TensorRT helps to optimize inference performance

    Optimize Inference Performance

    TensorRT, built on the CUDA? parallel programming model, optimizes inference using techniques such as quantization, layer and tensor fusion, and kernel tuning on all types of NVIDIA GPUs, from edge devices to PCs to data centers.

    TensorRT helps to accelerate every workload

    Accelerate Every Workload

    TensorRT provides post-training and quantization-aware training techniques for optimizing FP8, INT8, and INT4 for deep learning inference. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.

    TensorRT-optimized models can be deployed, run, and scaled with NVIDIA Triton

    Deploy, Run, and Scale With Triton

    TensorRT-optimized models are deployed, run, and scaled with NVIDIA Triton? inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.


    Explore the Features and Tools of NVIDIA TensorRT

    Decorative

    Large Language Model Inference

    NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of recent large language models (LLMs) on the NVIDIA AI platform. Developers experiment with new LLMs for high performance and quick customization with a simplified Python API.

    Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUs—including NVIDIA RTX? systems on native Windows—with the same seamless workflow.

    Decorative

    Optimized Inference Engines

    NVIDIA TensorRT Cloud is a developer service for compiling and creating optimized inference engines for ONNX. Developers can use their own model and choose the target RTX GPU. Then TensorRT Cloud builds the optimized inference engine, which can be downloaded and integrated into an application. TensorRT Cloud also provides prebuilt, optimized engines for popular LLMs on RTX GPUs.

    TensorRT Cloud is available in early access on NVIDIA GeForce RTX? GPUs to select partners. Apply to be notified when it's publicly available.

    Decorative

    Optimize Neural Networks

    NVIDIA TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM and TensorRT to efficiently optimize inference on NVIDIA GPUs.

    Decorative

    Major Framework Integrations

    TensorRT integrates directly into PyTorch, Hugging Face, and TensorFlow to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson?, NVIDIA DRIVE?, and data center platforms.


    World-Leading Inference Performance

    TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.

    See All Benchmarks

    8X Increase in GPT-J 6B Inference Performance

    TensorRT-LLM on H100 has 8X increase in GPT-J 6B inference performance

    4X Higher Llama2 Inference Performance

    TensorRT-LLM on H100 has 4X Higher Llama2 Inference Performance

    Total Cost of Ownership

    Lower is better
    TensorRT-LLM has lower total cost of ownership than GPT-J 6B and Llama 2 70B

    Energy Use

    Lower is better
    TensorRT-LLM has lower energy use than GPT-J 6B and Llama 2 70B

    Accelerate Every Inference Platform

    TensorRT can optimize AI deep learning models for applications across the edge, laptops and desktops, and data centers. It powers key NVIDIA solutions, such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara?, and NVIDIA JetPack?.

    TensorRT is also integrated with application-specific SDKs, such as NVIDIA NIM, NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin?, NVIDIA Maxine?, NVIDIA Morpheus, and NVIDIA Broadcast Engine. TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.

    From creator apps to games and productivity tools, TensorRT is embraced by millions of NVIDIA RTX, GeForce?, Quadro? GPU users. Whether integrated directly or via the ONNX-Runtime framework, TensorRT-optimized engines are weightless and compressed, empowering developers to incorporate AI-rich features without bloating app sizes.

    TensorRT integrates with application-specific SDKs

    Read Success Stories

    Learn how NVIDIA TensorRT supports Amazon.

    Amazon

    Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.

    Learn how NVIDIA TensorRT supports AMEX." title="Learn how NVIDIA TensorRT supports AMEX.

    American Express

    American Express improves fraud detection by analyzing tens of millions of daily transactions 50X faster. Find out how.

    Learn how NVIDIA TensorRT supports Zoox.

    Zoox

    Explore how Zoox, a robotaxi startup, accelerated their perception stack by 19X using TensorRT for real-time inference on autonomous vehicles.


    Widely Adopted Across Industries

    NVIDIA TensorRT is widely adopted by top companies across industries

    TensorRT Resources

    Read the Introductory TensorRT Blog

    Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

    Watch On-Demand TensorRT Sessions From GTC

    Learn more about TensorRT and its features from a curated list of webinars at GTC.

    Get the Introductory Developer Guide

    See how to get started with TensorRT in this step-by-step developer and API reference guide.

    Use the right inference tools to develop AI for any application on any platform.

    Get Started

    人人超碰97caoporen国产