Llama – NVIDIA Technical Blog

Llama – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-08T01:00:00Z http://www.open-lab.net/blog/feed/ Yilin Fan <![CDATA[Blackwell Breaks the 1,000 TPS/User Barrier With Meta��s Llama 4 Maverick]]> http://www.open-lab.net/blog/?p=100729 2025-06-12T18:51:04Z 2025-05-23T00:09:02Z

NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...]]>

NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...

Llama 4 1000 featured

NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over 1,000 tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model, the largest and most powerful model available in the Llama 4 collection. This speed was independently measured by the AI benchmarking service��

]]> 1 Nirmal Kumar Juluru <![CDATA[Building Nemotron-CC, A High-Quality Trillion Token Dataset for LLM Pretraining from Common Crawl Using NVIDIA NeMo Curator]]> http://www.open-lab.net/blog/?p=99540 2025-05-29T19:05:18Z 2025-05-07T16:22:31Z

Curating high-quality pretraining datasets is critical for enterprise developers aiming to train state-of-the-art large language models (LLMs). To enable...]]>

Curating high-quality pretraining datasets is critical for enterprise developers aiming to train state-of-the-art large language models (LLMs). To enable...

Building Nemotron-CC image copy

Curating high-quality pretraining datasets is critical for enterprise developers aiming to train state-of-the-art large language models (LLMs). To enable developers to build highly accurate LLMs, NVIDIA previously released Nemotron-CC, a 6.3-trillion-token English language Common Crawl (CC) dataset. Today, the NVIDIA NeMo Curator team is excited to share that the pipeline used to build the��

]]> 0 Annamalai Chockalingam <![CDATA[Kickstart Your AI Journey on RTX AI PCs and Workstations with NVIDIA NIM Microservices]]> http://www.open-lab.net/blog/?p=97991 2025-06-10T18:45:54Z 2025-03-25T13:00:00Z

With emerging use cases such as digital humans, agents, podcasts, images, and video generation, generative AI is changing the way we interact with PCs. This...]]>

With emerging use cases such as digital humans, agents, podcasts, images, and video generation, generative AI is changing the way we interact with PCs. This... Decorative image of product icons.

Decorative image of product icons.

With emerging use cases such as digital humans, agents, podcasts, images, and video generation, generative AI is changing the way we interact with PCs. This paradigm shift calls for new ways of interfacing with and programming generative AI models. However, getting started can be daunting for PC developers and AI enthusiasts. Today, NVIDIA released a suite of NVIDIA NIM microservices on��

]]> 0 Michelle Horton <![CDATA[Top Posts of 2024 Highlight NVIDIA NIM, LLM Breakthroughs, and Data Science Optimization]]> http://www.open-lab.net/blog/?p=93566 2024-12-16T18:34:16Z 2024-12-16T18:34:14Z

2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to...]]>

2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to...

Technical-Blog-2024-Top-Posts

2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to empowering open-source contributions, these blog posts highlight the breakthroughs that resonated most with our readers. NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale Introduced in��

]]> 0 Ashraf Eassa <![CDATA[Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=90142 2024-11-22T23:11:53Z 2024-11-19T16:00:00Z

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...]]>

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...

three-llamas-holding-number-10-signs

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and��

]]> 0 Sukru Burc Eryilmaz <![CDATA[NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1]]> http://www.open-lab.net/blog/?p=91807 2024-11-14T17:10:37Z 2024-11-13T16:00:00Z

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance,...]]>

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance,...

NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1

As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance, delivered at data center scale, is required. The NVIDIA Blackwell platform, launched at GTC 2024 and now in full production, integrates seven types of chips: GPU, CPU, DPU, NVLink Switch chip, InfiniBand Switch, and Ethernet Switch.

]]> 0 Amr Elmeleegy <![CDATA[5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse]]> http://www.open-lab.net/blog/?p=91625 2025-05-01T18:34:40Z 2024-11-08T23:55:43Z

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...]]>

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up... NVIDIA H100.

NVIDIA H100.

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly��

]]> 0 Nick Comly <![CDATA[Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch]]> http://www.open-lab.net/blog/?p=90040 2024-11-22T23:12:12Z 2024-10-09T15:00:00Z

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...]]>

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...

HGX-H200-product-photo-close-up copy 2

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items��

]]> 1 Chintan Patel <![CDATA[Improve Reinforcement Learning from Human Feedback with Leaderboard-Topping Reward Model]]> http://www.open-lab.net/blog/?p=89583 2024-11-04T22:57:33Z 2024-09-30T19:21:18Z

Llama 3.1 Nemotron 70B Reward model helps generate high-quality training data that aligns with human preferences for finance, retail, healthcare, scientific...]]>

Llama 3.1 Nemotron 70B Reward model helps generate high-quality training data that aligns with human preferences for finance, retail, healthcare, scientific...

rag-visual-blog-1920x1080

Llama 3.1 Nemotron 70B Reward model helps generate high-quality training data that aligns with human preferences for finance, retail, healthcare, scientific research, telecommunications, and sovereign AI.

]]> 0 Nick Comly <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]>

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...

NVIDIA GH200

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

]]> 0 Akhiad Bercovich <![CDATA[Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B]]> http://www.open-lab.net/blog/?p=89283 2024-10-04T22:08:06Z 2024-09-23T16:41:34Z

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s...]]>

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s...

federated-ml-nvidia-flare-featured

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s Llama-3.1-70B, uses a novel neural architecture search (NAS) approach that results in a highly accurate and efficient model. The model fits on a single NVIDIA H100 GPU at high workloads, making it much more accessible and affordable.

]]> 2 Ashraf Eassa <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]>

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that... Image of an HGX H200

Image of an HGX H200

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as ��one mighty GPU�� with ultra-fast GPU-to-GPU communication and advanced software able to take full��

]]> 0 Anjali Shah <![CDATA[Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs]]> http://www.open-lab.net/blog/?p=88017 2024-11-14T15:58:41Z 2024-08-28T19:30:00Z

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...]]>

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...

llama-sunglasses-meadow

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high��

]]> 1 Anjali Shah <![CDATA[Supercharging Llama 3.1 across NVIDIA Platforms]]> http://www.open-lab.net/blog/?p=85678 2025-02-17T05:23:06Z 2024-07-23T15:15:00Z

Meta's Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases....]]>

Meta's Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases.... Decorative image of a llama in cool sunglasses against a sunny landscape.

Decorative image of a llama in cool sunglasses against a sunny landscape.

Meta��s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications. With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as a suite of trust and safety models��

]]> 13 Vinay Bagade <![CDATA[Build an Agentic RAG Pipeline with Llama 3.1 and NVIDIA NeMo Retriever NIMs]]> http://www.open-lab.net/blog/?p=85884 2024-10-28T21:47:15Z 2024-07-23T15:15:00Z

Employing retrieval-augmented generation (RAG) is an effective strategy for ensuring large language model (LLM) responses are up-to-date and not...]]>

Employing retrieval-augmented generation (RAG) is an effective strategy for ensuring large language model (LLM) responses are up-to-date and not... An illustrations representing agnetic RAG.

An illustrations representing agnetic RAG.

Employing retrieval-augmented generation (RAG) is an effective strategy for ensuring large language model (LLM) responses are up-to-date and not hallucinated. While various retrieval strategies can improve the recall of documents for generation, there is no one-size-fits-all approach. The retrieval pipeline depends on your data, from hyperparameters like the chunk size��

]]> 0 Tanay Varshney <![CDATA[Creating Synthetic Data Using Llama 3.1 405B]]> http://www.open-lab.net/blog/?p=85922 2024-08-08T18:48:35Z 2024-07-23T15:15:00Z

Synthetic data isn��t about creating new information. It's about transforming existing information to create different variants. For over a decade, synthetic...]]>

Synthetic data isn��t about creating new information. It's about transforming existing information to create different variants. For over a decade, synthetic... An illustration representing syntheti

An illustration representing syntheti

Synthetic data isn��t about creating new information. It��s about transforming existing information to create different variants. For over a decade, synthetic data has been used to improve model accuracy across the board��whether it is transforming images to improve object detection models, strengthening fraudulent credit card detection, or improving BERT models for QA. What��s new?

]]> 0 Anjali Shah <![CDATA[Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server]]> http://www.open-lab.net/blog/?p=81223 2024-11-14T15:54:32Z 2024-04-28T18:07:15Z

We're excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You...]]>

We're excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You... Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

We��re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama 3 70B��the first models in the series��through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the NVIDIA API catalog, where Llama 3 is packaged as��

]]> 61 ��˳��97caoporen��