Inference Performance

Jul 01, 2025

Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the...

10 MIN READ

Jun 26, 2025

Run Google DeepMind’s Gemma 3n on NVIDIA Jetson and RTX

As of today, NVIDIA now supports the general availability of Gemma 3n on NVIDIA RTX and Jetson. Gemma, previewed by Google DeepMind at Google I/O last month,...

4 MIN READ

Jun 24, 2025

Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques—such as...

11 MIN READ

Jun 13, 2025

Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer??

Best-in-class LLM Inference requires two key elements: speed and developer velocity. Speed refers to maximizing the efficiency of the underlying hardware by...

6 MIN READ

Jun 12, 2025

Run High-Performance AI Applications with NVIDIA TensorRT for RTX

NVIDIA TensorRT for RTX is now available for download as an SDK that can be integrated into C++ and Python applications for both Windows and Linux. At...

7 MIN READ

Jun 06, 2025

How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models

The latest wave of open source large language models (LLMs), like DeepSeek R1, Llama 4, and Qwen3, have embraced Mixture of Experts (MoE) architectures. Unlike...

12 MIN READ

May 22, 2025

Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...

9 MIN READ

May 21, 2025

NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference

The introduction of the llm-d community at Red Hat Summit 2025 marks a significant step forward in accelerating generative AI inference innovation for the open...

5 MIN READ

Decorative image of a datacenter with floating icons overlaid.

May 06, 2025

LLM Inference Benchmarking Guide: NVIDIA GenAI-Perf and NIM

This is the second post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...

11 MIN READ

Apr 21, 2025

Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant...

8 MIN READ

Apr 02, 2025

NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0

The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...

10 MIN READ

Apr 02, 2025

LLM Inference Benchmarking: Fundamental Concepts

This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM...

15 MIN READ

Mar 20, 2025

Boost Llama Model Performance on Microsoft Azure AI Foundry with NVIDIA TensorRT-LLM

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform....

4 MIN READ

Mar 18, 2025

Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for...

14 MIN READ

Mar 18, 2025

NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...

14 MIN READ

Feb 14, 2025

Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...

7 MIN READ