Learn to train neural networks across multiple servers and deploy large multi-GPU models to production using NVIDIA Triton Inference Serve in our instructor-led workshop on June 27th.   Register Now

NVIDIA Triton

NVIDIA Triton?, an open-source inference serving software, standardizes AI model deployment and execution and delivers fast and scalable AI in production. Triton is part of NVIDIA AI Enterprise, an NVIDIA software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.


Talk to an Expert        Free Trial        Download



How to Put AI Models into Production

NVIDIA Triton, also known as NVIDIA Triton Inference Server, streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained ML or DL models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.


Explore the Benefits of NVIDIA Triton

Triton supports multiple inference frameworks

Support for Multiple Frameworks

Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA? TensorRT?, PyTorch, Python, ONNX, RAPIDS? cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.

Triton offers high-performance inference

High-Performance AI Inference

Triton supports all NVIDIA GPU-, x86-, Arm? CPU-, and AWS Inferentia-based inferencing. It offers dynamic batching, concurrent execution, optimal model configuration, model ensemble, and streaming audio/video inputs to maximize throughput and utilization.

Triton is designed for DevOps and MLOps

Designed for DevOps and MLOps

Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring supports live model updates and can be used in all major public cloud AI and Kubernetes platforms. It’s also integrated into many MLOps software solutions.

Triton is an integral part of NVIDIA AI platform

An Integral Part of NVIDIA AI

The NVIDIA AI platform, which includes Triton, gives enterprises the compute power, tools, and algorithms they need to succeed in AI, accelerating workloads from speech recognition and recommender systems to medical imaging and improved logistics.

Triton supports model ensembles and pipelines

Supports Model Ensembles

Because most modern inference requires multiple models with preprocessing and post-processing to be executed for a single query, Triton supports model ensembles and pipelines. Triton can execute the parts of the ensemble on CPU or GPU and allows multiple frameworks inside the ensemble.

Triton supports model ensembles and pipelines

Enterprise-Grade Security and API Stability

NVIDIA AI Enterprise includes NVIDIA Triton for production inference, accelerating enterprises to the leading edge of AI with enterprise support, security, and API stability while mitigating the potential risks of open-source software.


Fast and Scalable AI in Every Application

Achieve High-Throughput Inference

Triton executes multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.


It also optimizes serving for real-time inferencing under strict latency constraints with dynamic batching, supports batch inferencing to maximize GPU and CPU utilization, and includes built-in support for audio and video streaming input. Triton supports model ensemble for use cases that require a pipeline of multiple models with pre- and postprocessing to perform end-to-end inference, such as conversational AI.


Models can be updated live in production without restarting Triton or the application. Triton enables multi-GPU, multi-node inference on very large models that cannot fit in a single GPU’s memory.

 NVIDIA Triton delivers high inference throughput

Scale Inference With Ease

Available as a Docker container, Triton integrates with Kubernetes for orchestration, metrics, and autoscaling. Triton also integrates with Kubeflow and KServe for an end-to-end AI workflow and exports Prometheus metrics for monitoring GPU utilization, latency, memory usage, and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers and can easily scale to any number of servers to handle increasing inference loads for any model.


Triton can serve tens or hundreds of models. Models can be loaded and unloaded into and out of the inference server based on changes in demand to fit in GPU or CPU memory. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads.

NVIDIA Triton can scale inference with ease

Take a Closer Look at Triton Functionality

NVIDIA Triton supports Python code.

Native Support in Python

PyTriton provides a simple interface that lets Python developers use Triton Inference Server to serve anything, be it a model, a simple processing function, or an entire inference pipeline.

This native support for Triton in Python enables rapid prototyping and testing of machine learning models with performance and efficiency, for example, high hardware utilization. A single line of code brings up Triton Inference Server, providing benefits such as dynamic batching, concurrent model execution, and support for GPU and CPU from within the Python code. This eliminates the need to set up model repositories and convert model formats. Existing inference pipeline code can be used without modification.


Learn More About PyTriton

Model Orchestration With Management Service

Triton brings new model orchestration functionality for efficient multi-model inference. This functionality, which runs as a production service, loads models on demand and unloads models when not in use. It efficiently allocates GPU resources by placing as many models as possible on a single GPU server and helps to group models from different frameworks for efficient memory use. The model orchestration feature is in private early access (EA).


Sign up for EA
NVIDIA Triton offers model orchestration by allocating GPU resources across different models on a single GPU.
NVIDIA Triton supports scalable inference.

Large Language Model Inference

Models are rapidly growing in size, especially in areas of natural language processing, for example,GPT-3 - 175B, Megatron 530B models. GPUs are naturally the right compute resource for these large models, but these models are so large that they cannot fit on a single GPU. Triton can partition the model into multiple smaller files and execute each on a separate GPU within or across servers. FasterTransformer backend in Triton, which enables this multi-GPU, multi-node inference, provides optimized and scalable inference for GPT family, T5, OPT, and UL2 models today.


Learn More in the Blog
NVIDIA Triton supports scalable inference.

Optimal Model Configuration With Model Analyzer

Triton’s Model Analyzer is a tool that automatically evaluates model deployment configurations in Triton Inference Server, such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints—latency, throughput, and memory requirements—and reduces the time needed to find the optimal configuration from weeks to hours. This tool also supports model ensembles and multi-model analysis.


Learn More at GitHub
NVIDIA Triton automatically optimizes models.
Figure: Example output from model analyzer tool
NVIDIA Triton supports tree-based models with explainability on CPUs and GPUs.

Tree-based model inference with a Forest Inference Library (FIL) backend.

The new Forest Inference Library (FIL) backend in Triton provides support for high-performance inference of tree-based models with explainability (SHAP values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS? cuML RandomForest, and others in Treelite format.


Learn More
NVIDIA Triton supports tree-based models with explainability on CPUs and GPUs.

See Triton’s Ecosystem Integrations

AI is driving innovation across businesses of every size and scale, and NVIDIA AI is at the forefront of this innovation. An open-source software solution, Triton is the top choice for AI inference and model deployment. Triton is available in Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), Amazon SageMaker, Google Kubernetes Engine (GKE), Google Vertex AI, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS), Azure Machine Learning, and Oracle Cloud Infrastructure Data Science Platform.

Read Success Stories

Learn how NVIDIA AI improves Amazon customer satisfaction
amazon logo

Discover how Amazon improved customer satisfaction with NVIDIA AI by accelerating its inference 5X faster.

Learn more
Learn how NVIDIA AI improves AMEX fraud detection
american express logo

Learn how American Express improved fraud detection by analyzing tens of millions of daily transactions 50X faster.

Learn more
Learn how NVIDIA AI improves Siemens Energy physical inspections
Siemens Energy

Discover how Siemens Energy augmented physical inspections by providing AI-based remote monitoring for leaks, abnormal noises, and more.

Learn more

Discover More Resources

Learn how NVIDIA Triton can simplify AI deployment at scale.

Simplify and Standardize AI Deployment at Scale

Learn how Triton meets the challenges of deploying AI models at scale and download this step-by-step guide on streamlining AI inference with Triton.

Read whitepaper
Explore the latest NVIDIA Triton on-demand sessions.

Watch Inference and Triton GTC Sessions On-Demand

Check out the latest on-demand sessions from NVIDIA GTC on how to put AI into production with Triton.

Watch Videos
Deploy AI deep learning models.

AI Inference Blogs

Read the latest news and blogs on NVIDIA Triton, and learn how to streamline your AI inference deployment.

Read blogs

Join NVIDIA’s inference community and stay current on the latest feature updates, bug fixes, and more for Triton.


Register Now