Nick Comly – NVIDIA Technical Blog

Nick Comly – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-06-13T18:47:35Z http://www.open-lab.net/blog/feed/ Nick Comly <![CDATA[Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer??]]> http://www.open-lab.net/blog/?p=102153 2025-06-13T18:47:35Z 2025-06-13T18:45:53Z

Best-in-class LLM Inference requires two key elements: speed and developer velocity. Speed refers to maximizing the efficiency of the underlying hardware by...]]>

Best-in-class LLM Inference requires two key elements: speed and developer velocity. Speed refers to maximizing the efficiency of the underlying hardware by using highly optimized compute kernels algorithms. Developer velocity refers to the ability to quickly adopt these new kernels and accelerate new models, algorithms, and hardware. Ultimately, this velocity is underpinned by the quick…

]]> Nick Comly <![CDATA[Optimize AI Inference Performance with NVIDIA Full-Stack Solutions]]> http://www.open-lab.net/blog/?p=95310 2025-05-30T00:55:04Z 2025-01-24T16:00:00Z

The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...]]>

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure.

]]> Nick Comly <![CDATA[Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=90142 2024-11-22T23:11:53Z 2024-11-19T16:00:00Z

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...]]>

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and…

]]> Nick Comly <![CDATA[Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill]]> http://www.open-lab.net/blog/?p=92052 2024-11-15T17:59:38Z 2024-11-15T17:59:35Z

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...]]>

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to…

]]> Nick Comly <![CDATA[5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse]]> http://www.open-lab.net/blog/?p=91625 2025-05-01T18:34:40Z 2024-11-08T23:55:43Z

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...]]>

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly…

]]> Nick Comly <![CDATA[3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot]]> http://www.open-lab.net/blog/?p=91412 2025-05-01T18:34:34Z 2024-11-01T22:00:36Z

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...]]>

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands – and where input sequence lengths differ with each request – poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must – irrespective of the GPU generation or its memory capacity. To enhance inference performance in…

]]> 1 Nick Comly <![CDATA[NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models]]> http://www.open-lab.net/blog/?p=90897 2024-11-06T02:24:56Z 2024-10-28T15:00:00Z

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing...]]>

Deploying large language models (LLMs) in production environments often requires making hard trade-offs between enhancing user interactivity and increasing system throughput. While enhancing user interactivity requires minimizing time to first token (TTFT), increasing throughput requires increasing tokens per second. Improving one aspect often results in the decline of the other…

]]> 1 Nick Comly <![CDATA[Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch]]> http://www.open-lab.net/blog/?p=90040 2024-11-22T23:12:12Z 2024-10-09T15:00:00Z

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...]]>

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items…

]]> 1 Nick Comly <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]>

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

]]> Nick Comly <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]>

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software able to take full…

]]> Nick Comly <![CDATA[Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs]]> http://www.open-lab.net/blog/?p=88017 2024-11-14T15:58:41Z 2024-08-28T19:30:00Z

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...]]>

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high…

]]> 1 Nick Comly <![CDATA[NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference]]> http://www.open-lab.net/blog/?p=87063 2024-08-22T18:25:32Z 2024-08-12T14:00:00Z

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements...]]>

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are simultaneously important. Even if a large…

]]> Nick Comly <![CDATA[Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=84749 2024-08-07T23:50:14Z 2024-07-02T18:00:00Z

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to...]]>

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. Delivering high LLM inference performance requires an efficient parallel computing architecture and a flexible and highly optimized software stack. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set…

]]> Nick Comly <![CDATA[NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support]]> http://www.open-lab.net/blog/?p=82402 2024-05-30T19:55:57Z 2024-05-14T15:00:00Z

NVIDIA today announced the latest release of NVIDIA TensorRT, an ecosystem of APIs for high-performance deep learning inference. TensorRT includes inference...]]>

NVIDIA today announced the latest release of NVIDIA TensorRT, an ecosystem of APIs for high-performance deep learning inference. TensorRT includes inference runtimes and model optimizations that deliver low latency and high throughput for production applications. This post outlines the key features and upgrades of this release, including easier installation, increased usability…

]]> Nick Comly <![CDATA[NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200]]> http://www.open-lab.net/blog/?p=74771 2023-12-14T19:27:30Z 2023-12-05T01:11:43Z

Large language models (LLMs) have seen dramatic growth over the last year, and the challenge of delivering great user experiences depends on both high-compute...]]>

Large language models (LLMs) have seen dramatic growth over the last year, and the challenge of delivering great user experiences depends on both high-compute throughput as well as large amounts of high-bandwidth memory. NVIDIA TensorRT-LLM provides optimizations for both peak throughput and memory optimization, delivering massive improvements in LLM inference performance.

]]> 0 Nick Comly <![CDATA[Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available]]> http://www.open-lab.net/blog/?p=71648 2024-04-19T15:19:08Z 2023-10-19T16:00:00Z

Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source...]]>

Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source library is now available for free on the /NVIDIA/TensorRT-LLM GitHub repo and as part of the NVIDIA NeMo framework. Large language models (LLMs) have revolutionized the field of artificial intelligence and created entirely new ways of…

]]> 8 Nick Comly <![CDATA[NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs]]> http://www.open-lab.net/blog/?p=70549 2023-11-07T22:27:14Z 2023-09-09T17:00:00Z

Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique...]]>

Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique execution characteristics can make them difficult to use in cost-effective ways. NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now a part of Databricks)…

]]> 5 Nick Comly <![CDATA[Optimizing and Serving Models with NVIDIA TensorRT and NVIDIA Triton]]> http://www.open-lab.net/blog/?p=50553 2025-03-18T18:23:55Z 2022-07-20T16:00:00Z

Imagine that you have trained your model with PyTorch, TensorFlow, or the framework of your choice, are satisfied with its accuracy, and are considering...]]>

Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. Imagine that you have trained your model with PyTorch, TensorFlow, or the framework of…

]]> 1 ��˳��97caoporen��