Development & Optimization

Jul 18, 2025

Optimizing for Low-Latency Communication in Inference Workloads with JAX and XLA

Running inference with large language models (LLMs) in production requires meeting stringent latency constraints. A critical stage in the process is LLM decode,...

6 MIN READ

Jul 16, 2025

CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design

GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and...

12 MIN READ

Jul 16, 2025

CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels

In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models...

12 MIN READ

Jul 15, 2025

NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

Amazon Web Services (AWS) developers and solution architects can now take advantage of NVIDIA Dynamo on NVIDIA GPU-based Amazon EC2, including Amazon EC2 P6...

4 MIN READ

Jul 09, 2025

Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO

Reinforcement learning (RL) is the backbone of interactive AI. It is fundamental for teaching agents to reason and learn from human preferences, enabling...

5 MIN READ

Jul 09, 2025

Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

C++ libraries like CUB and Thrust provide high-level building blocks that enable NVIDIA CUDA application and library developers to write speed-of-light code...

5 MIN READ

Jul 07, 2025

Think Smart and Ask an Encyclopedia-Sized Question: Multi-Million Token Real-Time Inference for 32X More Users

Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents...

8 MIN READ

Jul 03, 2025

New Video: Build Self-Improving AI Agents with the NVIDIA Data Flywheel Blueprint

AI agents powered by large language models are transforming enterprise workflows, but high inference costs and latency can limit their scalability and user...

2 MIN READ

Jul 02, 2025

Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX

As accelerated computing continues to drive application performance in all areas of AI and scientific computing, there's a renewed interest in GPU optimization...

11 MIN READ

Jun 25, 2025

How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills

A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or...

10 MIN READ

Jun 18, 2025

Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...

11 MIN READ

Jun 18, 2025

Compiler Explorer: An Essential Kernel Playground for CUDA Developers

Have you ever wondered exactly what the CUDA compiler generates when you write GPU kernels? Ever wanted to share a minimal CUDA example with a colleague...

7 MIN READ

Jun 13, 2025

Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer??

Best-in-class LLM Inference requires two key elements: speed and developer velocity. Speed refers to maximizing the efficiency of the underlying hardware by...

6 MIN READ

Jun 12, 2025

Accelerated Sequence Alignment for Protein Science with MMseqs2-GPU and NVIDIA NIM

Protein sequence alignment—comparing protein sequences for similarities—is fundamental to modern biology and medicine. It illuminates gene functions by...

9 MIN READ

Jun 11, 2025

Accelerate Decision Optimization Using Open Source NVIDIA cuOpt

Businesses make thousands of decisions every day—what to produce, where to ship, how to allocate resources. At scale, optimizing these decisions becomes a...

5 MIN READ

Jun 11, 2025

Introducing NVIDIA DGX Cloud Lepton: A Unified AI Platform Built for Developers

The age of AI-native applications has arrived. Developers are building advanced agentic and physical AI systems—but scaling across geographies and GPU...

6 MIN READ