Inference Performance – NVIDIA Technical Blog

Inference Performance – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-05-16T23:50:38Z http://www.open-lab.net/blog/feed/ Vinh Nguyen <![CDATA[LLM Inference Benchmarking Guide: NVIDIA GenAI-Perf and NIM]]> http://www.open-lab.net/blog/?p=99180 2025-05-15T19:07:45Z 2025-05-06T17:35:39Z

This is the second post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...]]>

This is the second post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?... Decorative image of a datacenter with floating icons overlaid.

Decorative image of a datacenter with floating icons overlaid.

This is the second post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM. When building LLM-based applications, it is critical to understand the performance characteristics of these models on a given hardware. This serves multiple purposes: As a client-side LLM-focused benchmarking tool��

]]> 0 Maximilian M��ller <![CDATA[Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT]]> http://www.open-lab.net/blog/?p=98927 2025-05-15T19:08:48Z 2025-04-21T18:44:38Z

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant...]]>

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant...

painting-woman-hair-giant-flowers

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant computational resources and high costs. By leveraging the latest FP8 quantization features on NVIDIA Hopper GPUs with NVIDIA TensorRT, it��s possible to significantly reduce inference costs and serve more users with fewer GPUs.

]]> 0 Ashraf Eassa <![CDATA[NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0]]> http://www.open-lab.net/blog/?p=98367 2025-04-23T19:41:12Z 2025-04-02T18:14:48Z

The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...]]>

The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...

nvidia-blackwell

The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency requirements, and, most recently, AI reasoning. At the same time, as AI adoption grows, the ability of an AI factory to serve as many users as possible, all while maintaining good per-user experiences, is key to maximizing the value it generates.

]]> 0 Vinh Nguyen <![CDATA[LLM Inference Benchmarking: Fundamental Concepts]]> http://www.open-lab.net/blog/?p=98215 2025-05-09T18:23:04Z 2025-04-02T17:00:00Z

This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM...]]>

This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM...

data-center

This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM benchmarking, fundamental concepts, and how to benchmark your LLM applications. The past few years have witnessed the rise in popularity of generative AI and large language models (LLMs), as part of a broad AI revolution.

]]> 0 Uttara Kumar <![CDATA[Boost Llama Model Performance on Microsoft Azure AI Foundry with NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=97008 2025-04-23T00:07:01Z 2025-03-20T15:00:00Z

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform....]]>

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform....

llm-graphic

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform. These advancements, enabled by NVIDIA TensorRT-LLM optimizations, deliver significant gains in throughput, reduced latency, and improved cost efficiency, all while preserving the quality of model outputs. With these improvements��

]]> 0 Amr Elmeleegy <![CDATA[Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models]]> http://www.open-lab.net/blog/?p=95274 2025-04-23T00:15:55Z 2025-03-18T17:50:00Z

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for...]]>

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for...

computer-monitor-data-center-abstract

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The framework boosts the number of requests served by up to 30x, when running the open-source DeepSeek-R1 models on NVIDIA Blackwell.

]]> 1 Ashraf Eassa <![CDATA[NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance]]> http://www.open-lab.net/blog/?p=97352 2025-04-23T00:23:25Z 2025-03-18T17:41:42Z

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...]]>

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...

inference-world-record-nvidia-blackwell

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance��

]]> 1 Anjali Shah <![CDATA[Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding]]> http://www.open-lab.net/blog/?p=96010 2025-04-23T02:44:36Z 2025-02-14T18:19:37Z

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...]]>

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...

computer-screen-abstract

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder��

]]> 1 Nick Comly <![CDATA[Optimize AI Inference Performance with NVIDIA Full-Stack Solutions]]> http://www.open-lab.net/blog/?p=95310 2025-04-23T15:02:06Z 2025-01-24T16:00:00Z

The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...]]>

The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...

ai-model-composite-graphic.

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure. NVIDIA is empowering developers with full-stack innovations��spanning chips, systems��

]]> 0 Rakib Hasan <![CDATA[NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference]]> http://www.open-lab.net/blog/?p=92963 2025-03-11T01:44:00Z 2024-12-18T17:31:01Z

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...]]>

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...

tensor-rt-llm-graphic

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM. ReDrafter helps developers significantly boost LLM workload performance on NVIDIA GPUs. NVIDIA TensorRT-LLM is a library for optimizing LLM inference. It provides an easy-to-use Python API to��

]]> 0 Anjali Shah <![CDATA[Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding]]> http://www.open-lab.net/blog/?p=94146 2024-12-19T23:03:40Z 2024-12-17T17:00:00Z

Meta's Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only...]]>

Meta's Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only...

three-llamas-wearing-goggles

Meta��s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including math, reasoning, coding��

]]> 2 Amr Elmeleegy <![CDATA[Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack]]> http://www.open-lab.net/blog/?p=93396 2025-03-18T18:26:38Z 2024-12-05T17:58:43Z

The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...]]>

The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...

inference-perplexity-ai

As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with provisioning the necessary hardware and software to meet that demand while simultaneously balancing cost efficiency with optimal user experience. This challenge was faced by the��

]]> 0 Carl (Izzy) Putterman <![CDATA[TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x]]> http://www.open-lab.net/blog/?p=92847 2025-01-11T17:32:51Z 2024-12-02T23:09:43Z

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...]]>

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that... Image of the TensorRT-LLM icon next to multiple other icons of computer activities.

Image of the TensorRT-LLM icon next to multiple other icons of computer activities.

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported��

]]> 3 Amr Elmeleegy <![CDATA[NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200]]> http://www.open-lab.net/blog/?p=92591 2024-12-12T19:47:20Z 2024-11-22T00:53:18Z

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...]]>

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series... Image of an HGX H200

Image of an HGX H200

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series of models introduced in July 2023 had a context length of 4K tokens, and the Llama 3.1 models, introduced only a year later, dramatically expanded that to 128K tokens. While long context lengths allow models to perform cognitive tasks��

]]> 1 Ashraf Eassa <![CDATA[Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=90142 2024-11-22T23:11:53Z 2024-11-19T16:00:00Z

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...]]>

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...

three-llamas-holding-number-10-signs

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and��

]]> 0 Amr Elmeleegy <![CDATA[Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill]]> http://www.open-lab.net/blog/?p=92052 2024-11-15T17:59:38Z 2024-11-15T17:59:35Z

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...]]>

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to��

]]> 0 Amr Elmeleegy <![CDATA[5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse]]> http://www.open-lab.net/blog/?p=91625 2025-05-01T18:34:40Z 2024-11-08T23:55:43Z

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...]]>

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up... NVIDIA H100.

NVIDIA H100.

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly��

]]> 0 Anton Korzh <![CDATA[3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot]]> http://www.open-lab.net/blog/?p=91412 2025-05-01T18:34:34Z 2024-11-01T22:00:36Z

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...]]>

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input... Image of an HGX H200

Image of an HGX H200

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input sequence lengths differ with each request �C poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must �C irrespective of the GPU generation or its memory capacity. To enhance inference performance in��

]]> 1 Amr Elmeleegy <![CDATA[NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models]]> http://www.open-lab.net/blog/?p=90897 2024-11-06T02:24:56Z 2024-10-28T15:00:00Z

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing...]]>

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing...

grace-superchip

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing system throughput. While enhancing user interactivity requires minimizing time to first token (TTFT), increasing throughput requires increasing tokens per second. Improving one aspect often results in the decline of the other��

]]> 1 Ivan Goldwasser <![CDATA[NVIDIA Grace CPU Delivers World-Class Data Center Performance and Breakthrough Energy Efficiency]]> http://www.open-lab.net/blog/?p=90087 2024-11-06T02:26:22Z 2024-10-09T19:00:00Z

NVIDIA designed the NVIDIA Grace CPU to be a new kind of high-performance, data center CPU��one built to deliver breakthrough energy efficiency and optimized...]]>

NVIDIA designed the NVIDIA Grace CPU to be a new kind of high-performance, data center CPU��one built to deliver breakthrough energy efficiency and optimized...

nvidia-grace-cpu

NVIDIA designed the NVIDIA Grace CPU to be a new kind of high-performance, data center CPU��one built to deliver breakthrough energy efficiency and optimized for performance at data center scale. Accelerated computing is enabling giant leaps in performance and energy efficiency compared to traditional CPU computing. To deliver these speedups, full-stack innovation at data center scale is��

]]> 0 Nick Comly <![CDATA[Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch]]> http://www.open-lab.net/blog/?p=90040 2024-11-22T23:12:12Z 2024-10-09T15:00:00Z

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...]]>

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...

HGX-H200-product-photo-close-up copy 2

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items��

]]> 1 Nick Comly <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]>

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...

NVIDIA GH200

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

]]> 0 Amr Elmeleegy <![CDATA[NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference v4.1]]> http://www.open-lab.net/blog/?p=89401 2024-11-06T02:27:00Z 2024-09-24T16:36:57Z

In the latest round of MLPerf Inference �C a suite of standardized, peer-reviewed inference benchmarks �C the NVIDIA platform delivered outstanding...]]>

In the latest round of MLPerf Inference �C a suite of standardized, peer-reviewed inference benchmarks �C the NVIDIA platform delivered outstanding...

NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference

In the latest round of MLPerf Inference �C a suite of standardized, peer-reviewed inference benchmarks �C the NVIDIA platform delivered outstanding performance across the board. Among the many submissions made using the NVIDIA platform were results using the NVIDIA GH200 Grace Hopper Superchip. GH200 tightly couples an NVIDIA Grace CPU with an NVIDIA Hopper GPU using NVIDIA NVLink-C2C��

]]> 0 Ashraf Eassa <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]>

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that... Image of an HGX H200

Image of an HGX H200

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as ��one mighty GPU�� with ultra-fast GPU-to-GPU communication and advanced software able to take full��

]]> 0 Anjali Shah <![CDATA[Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs]]> http://www.open-lab.net/blog/?p=88017 2024-11-14T15:58:41Z 2024-08-28T19:30:00Z

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...]]>

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...

llama-sunglasses-meadow

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high��

]]> 1 ��˳��97caoporen��