Anjali Shah – NVIDIA Technical Blog

Anjali Shah – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-04-23T15:02:57Z http://www.open-lab.net/blog/feed/ Anjali Shah <![CDATA[NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance]]> http://www.open-lab.net/blog/?p=97352 2025-04-23T00:23:25Z 2025-03-18T17:41:42Z

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...]]>

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance…

]]> 1 Anjali Shah <![CDATA[Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding]]> http://www.open-lab.net/blog/?p=96010 2025-04-23T02:44:36Z 2025-02-14T18:19:37Z

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...]]>

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder…

]]> 1 Anjali Shah <![CDATA[Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=95040 2025-04-23T15:02:57Z 2025-01-16T22:57:30Z

Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the...]]>

Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context in LLM serving for generation of the next set of tokens. Caching these key and value elements from previous tokens avoids expensive recomputation and effectively leads to higher throughput. However…

]]> Anjali Shah <![CDATA[Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding]]> http://www.open-lab.net/blog/?p=94146 2024-12-19T23:03:40Z 2024-12-17T17:00:00Z

Meta's Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only...]]>

Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including math, reasoning, coding…

]]> 2 Anjali Shah <![CDATA[NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching]]> http://www.open-lab.net/blog/?p=93516 2024-12-12T19:35:15Z 2024-12-11T22:10:51Z

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...]]>

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following: The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of…

]]> Anjali Shah <![CDATA[TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x]]> http://www.open-lab.net/blog/?p=92847 2025-01-11T17:32:51Z 2024-12-02T23:09:43Z

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...]]>

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported…

]]> 3 Anjali Shah <![CDATA[Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=90142 2024-11-22T23:11:53Z 2024-11-19T16:00:00Z

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...]]>

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and…

]]> Anjali Shah <![CDATA[Deploying Accelerated Llama 3.2 from the Edge to the Cloud]]> http://www.open-lab.net/blog/?p=89436 2024-11-07T05:08:12Z 2024-09-25T18:39:49Z

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an...]]>

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated computing platform, Llama 3.2 offers developers, researchers, and enterprises valuable new capabilities and optimizations to realize their…

]]> Anjali Shah <![CDATA[Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs]]> http://www.open-lab.net/blog/?p=88017 2024-11-14T15:58:41Z 2024-08-28T19:30:00Z

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...]]>

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high…

]]> 1 Anjali Shah <![CDATA[Jamba 1.5 LLMs Leverage Hybrid Architecture to Deliver Superior Reasoning and Long Context Handling]]> http://www.open-lab.net/blog/?p=87847 2024-09-05T17:57:25Z 2024-08-22T16:03:46Z

AI21 Labs has unveiled their latest and most advanced Jamba 1.5 model family, a cutting-edge collection of large language models (LLMs) designed to excel in a...]]>

AI21 Labs has unveiled their latest and most advanced Jamba 1.5 model family, a cutting-edge collection of large language models (LLMs) designed to excel in a wide array of generative AI tasks. These models are capable of creating content, summarizing and comparing documents, and extracting valuable insights from vast datasets. This mixture of experts (MoE) model takes advantage of the…

]]> Anjali Shah <![CDATA[Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU]]> http://www.open-lab.net/blog/?p=86123 2024-08-28T15:32:33Z 2024-07-26T21:03:15Z

NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing...]]>

NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing number of language models purpose-built for select tasks, NVIDIA Research and Mistral AI combined forces to offer a versatile, open language model that’s performant and runs on a single GPU, such as NVIDIA A100 or H100 GPUs.

]]> 3 Anjali Shah <![CDATA[Revolutionizing Code Completion with Codestral Mamba, the Next-Gen Coding LLM]]> http://www.open-lab.net/blog/?p=85101 2024-08-08T18:48:30Z 2024-07-25T19:57:14Z

In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software...]]>

In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software development. They provide significant benefits by automating complex tasks, enhancing scalability, and fostering innovation, making them invaluable tools in modern software development. This post explores the benefits of Codestral Mamba…

]]> Anjali Shah <![CDATA[Supercharging Llama 3.1 across NVIDIA Platforms]]> http://www.open-lab.net/blog/?p=85678 2025-02-17T05:23:06Z 2024-07-23T15:15:00Z

Meta's Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases....]]>

Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications. With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as a suite of trust and safety models…

]]> 13 Anjali Shah <![CDATA[Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server]]> http://www.open-lab.net/blog/?p=81223 2024-11-14T15:54:32Z 2024-04-28T18:07:15Z

We're excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You...]]>

We’re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama 3 70B—the first models in the series—through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the NVIDIA API catalog, where Llama 3 is packaged as…

]]> 61 Anjali Shah <![CDATA[Generate Stunning Images with Stable Diffusion XL on the NVIDIA AI Inference Platform]]> http://www.open-lab.net/blog/?p=78388 2025-03-18T18:31:44Z 2024-03-07T19:05:46Z

Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by...]]>

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by iteratively shaping random noise into AI-generated art through denoising diffusion…

]]> 1 Anjali Shah <![CDATA[NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma?]]> http://www.open-lab.net/blog/?p=78037 2024-11-14T15:52:24Z 2024-02-21T13:00:00Z

NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology...]]>

NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology used to create the Gemini models. An optimized release with TensorRT-LLM enables users to develop with LLMs using only a desktop with an NVIDIA RTX GPU. Created by Google DeepMind, Gemma 2B and Gemma 7B—the first models in the series…

]]> 0 Anjali Shah <![CDATA[Mastering LLM Techniques: Training?]]> http://www.open-lab.net/blog/?p=73464 2024-01-22T22:05:25Z 2023-11-16T14:00:00Z

Large language models (LLMs) are a class of generative AI models built using transformer networks that can recognize, summarize, translate, predict, and...]]>

Large language models (LLMs) are a class of generative AI models built using transformer networks that can recognize, summarize, translate, predict, and generate language using very large datasets. LLMs have the promise of transforming society as we know it, yet training these foundation models is incredibly challenging. This blog articulates the basic principles behind LLMs…

]]> 0 Anjali Shah <![CDATA[Mastering LLM Techniques: Customization]]> http://www.open-lab.net/blog/?p=68897 2023-12-08T18:54:22Z 2023-08-10T16:30:00Z

Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes....]]>

Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short in meeting the specific needs of enterprises due to industry-specific terminology, domain expertise, or unique requirements. This is where custom LLMs come into play.

]]> 0 ��˳��97caoporen��