Best-in-class LLM Inference requires two key elements: speed and developer velocity. Speed refers to maximizing the efficiency of the underlying hardware by using highly optimized compute kernels algorithms. Developer velocity refers to the ability to quickly adopt these new kernels and accelerate new models, algorithms, and hardware. Ultimately, this velocity is underpinned by the quick…
]]>As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure.
]]>Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and…
]]>In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to…
]]>In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly…
]]>Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands – and where input sequence lengths differ with each request – poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must – irrespective of the GPU generation or its memory capacity. To enhance inference performance in…
]]>Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing system throughput. While enhancing user interactivity requires minimizing time to first token (TTFT), increasing throughput requires increasing tokens per second. Improving one aspect often results in the decline of the other…
]]>The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items…
]]>Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).
]]>As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software able to take full…
]]>The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high…
]]>Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are simultaneously important. Even if a large…
]]>As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. Delivering high LLM inference performance requires an efficient parallel computing architecture and a flexible and highly optimized software stack. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set…
]]>NVIDIA today announced the latest release of NVIDIA TensorRT, an ecosystem of APIs for high-performance deep learning inference. TensorRT includes inference runtimes and model optimizations that deliver low latency and high throughput for production applications. This post outlines the key features and upgrades of this release, including easier installation, increased usability…
]]>Large language models (LLMs) have seen dramatic growth over the last year, and the challenge of delivering great user experiences depends on both high-compute throughput as well as large amounts of high-bandwidth memory. NVIDIA TensorRT-LLM provides optimizations for both peak throughput and memory optimization, delivering massive improvements in LLM inference performance.
]]>Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source library is now available for free on the /NVIDIA/TensorRT-LLM GitHub repo and as part of the NVIDIA NeMo framework. Large language models (LLMs) have revolutionized the field of artificial intelligence and created entirely new ways of…
]]>Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique execution characteristics can make them difficult to use in cost-effective ways. NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now a part of Databricks)…
]]>Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. Imagine that you have trained your model with PyTorch, TensorFlow, or the framework of…
]]>