TensorRT-LLM – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-03T22:20:47Z http://www.open-lab.net/blog/feed/ Shara Tibken <![CDATA[Check Out Sovereign AI in Practice Through an NVIDIA Webinar]]> http://www.open-lab.net/blog/?p=102680 2025-06-26T18:50:54Z 2025-06-25T17:59:52Z Join NVIDIA experts and leading European model builders on July 8 for a webinar on building and deploying multilingual large language models.]]> Join NVIDIA experts and leading European model builders on July 8 for a webinar on building and deploying multilingual large language models.

Join NVIDIA experts and leading European model builders on July 8 for a webinar on building and deploying multilingual large language models.

Source

]]>
0
Igor Gitman <![CDATA[How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills]]> http://www.open-lab.net/blog/?p=102597 2025-06-26T18:51:52Z 2025-06-25T17:13:59Z A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or...]]> A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or...

A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or reinforcement learning (RL), and model evaluation. Each stage requires using different libraries, which are often challenging to set up and difficult to use together. For example, you might use NVIDIA TensorRT-LLM or vLLM for SDG and NVIDIA��

Source

]]>
0
Eduardo Alvarez <![CDATA[Introducing NVFP4 for Efficient and Accurate Low-Precision Inference]]> http://www.open-lab.net/blog/?p=102000 2025-06-26T19:26:52Z 2025-06-24T16:18:46Z To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques��such as...]]> To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques��such as...

To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques��such as quantization, distillation, and pruning��typically come to mind. The most common of the three, without a doubt, is quantization. This is typically due to its post-optimization task-specific accuracy performance and broad choice of supported��

Source

]]>
0
Rita Fernandes Neves <![CDATA[NVIDIA Deep Learning Institute Offers Multilingual AI Training at GTC Paris]]> http://www.open-lab.net/blog/?p=101014 2025-06-24T16:10:45Z 2025-05-30T18:42:42Z Large language models (LLMs) are capable of recognizing, summarizing, translating, predicting, and generating content. Yet even the most powerful LLMs face...]]> Large language models (LLMs) are capable of recognizing, summarizing, translating, predicting, and generating content. Yet even the most powerful LLMs face...

Large language models (LLMs) are capable of recognizing, summarizing, translating, predicting, and generating content. Yet even the most powerful LLMs face limitations when working with specialized business knowledge, niche technical domains, or the diverse linguistic and cultural contexts of global operations. Most models labeled as multilingual, for example, are trained mainly in English��

Source

]]>
0
Yilin Fan <![CDATA[Blackwell Breaks the 1,000 TPS/User Barrier With Meta��s Llama 4 Maverick]]> http://www.open-lab.net/blog/?p=100729 2025-06-12T18:51:04Z 2025-05-23T00:09:02Z NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...]]> NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over...

NVIDIA has achieved a world-record large language model (LLM) inference speed. A single NVIDIA DGX B200 node with eight NVIDIA Blackwell GPUs can achieve over 1,000 tokens per second (TPS) per user on the 400-billion-parameter Llama 4 Maverick model, the largest and most powerful model available in the Llama 4 collection. This speed was independently measured by the AI benchmarking service��

Source

]]>
1
Ankit Patel <![CDATA[Integrate and Deploy Tongyi Qwen3 Models into Production Applications with NVIDIA]]> http://www.open-lab.net/blog/?p=99462 2025-06-10T18:48:25Z 2025-05-02T22:00:00Z Alibaba recently released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models,...]]> Alibaba recently released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models,...

Alibaba recently released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models, 235B-A22B (235B total parameters and 22B active parameters) and 30B-A3B, and six dense models, including the 0.6B, 1.7B, 4B, 8B, 14B, 32B versions. With ultra-fast token generation, developers can efficiently integrate and deploy Qwen3��

Source

]]>
0
Davide Paglieri <![CDATA[Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM]]> http://www.open-lab.net/blog/?p=99202 2025-05-15T19:08:40Z 2025-04-24T17:00:00Z This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...]]> This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM.?...

This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM. Researchers from the University College London (UCL) Deciding, Acting, and Reasoning with Knowledge (DARK) Lab leverage NVIDIA NIM microservices in their new game-based benchmark suite, Benchmarking Agentic LLM and VLM Reasoning On Games��

Source

]]>
0
Ashraf Eassa <![CDATA[NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0]]> http://www.open-lab.net/blog/?p=98367 2025-04-23T19:41:12Z 2025-04-02T18:14:48Z The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...]]> The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency...

The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency requirements, and, most recently, AI reasoning. At the same time, as AI adoption grows, the ability of an AI factory to serve as many users as possible, all while maintaining good per-user experiences, is key to maximizing the value it generates.

Source

]]>
0
Vinh Nguyen <![CDATA[LLM Inference Benchmarking: Fundamental Concepts]]> http://www.open-lab.net/blog/?p=98215 2025-05-09T18:23:04Z 2025-04-02T17:00:00Z This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM...]]> This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM...

This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM benchmarking, fundamental concepts, and how to benchmark your LLM applications. The past few years have witnessed the rise in popularity of generative AI and large language models (LLMs), as part of a broad AI revolution.

Source

]]>
0
Uttara Kumar <![CDATA[Boost Llama Model Performance on Microsoft Azure AI Foundry with NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=97008 2025-04-23T00:07:01Z 2025-03-20T15:00:00Z Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform....]]> Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform....

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform. These advancements, enabled by NVIDIA TensorRT-LLM optimizations, deliver significant gains in throughput, reduced latency, and improved cost efficiency, all while preserving the quality of model outputs. With these improvements��

Source

]]>
0
Ashraf Eassa <![CDATA[NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance]]> http://www.open-lab.net/blog/?p=97352 2025-04-23T00:23:25Z 2025-03-18T17:41:42Z NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...]]> NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over...

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance��

Source

]]>
1
Sangjune Park <![CDATA[Spotlight: NAVER Place Optimizes SLM-Based Vertical Services with NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=96279 2025-04-23T02:32:43Z 2025-02-28T17:57:49Z NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of...]]> NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of...

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. NAVER is a popular South Korean search engine company that offers Naver Place, a geo-based service that provides detailed information about millions of businesses and points of interest across Korea. Users can search about different places��

Source

]]>
0
Anjali Shah <![CDATA[Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding]]> http://www.open-lab.net/blog/?p=96010 2025-04-23T02:44:36Z 2025-02-14T18:19:37Z Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...]]> Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents,...

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder��

Source

]]>
1
Cheng-Han (Hank) Du <![CDATA[Improving Translation Quality with Domain-Specific Fine-Tuning and NVIDIA NIM]]> http://www.open-lab.net/blog/?p=95756 2025-06-25T17:52:57Z 2025-02-05T21:30:00Z Translation plays an essential role in enabling companies to expand across borders, with requirements varying significantly in terms of tone, accuracy, and...]]> Translation plays an essential role in enabling companies to expand across borders, with requirements varying significantly in terms of tone, accuracy, and...

Translation plays an essential role in enabling companies to expand across borders, with requirements varying significantly in terms of tone, accuracy, and technical terminology handling. The emergence of sovereign AI has highlighted critical challenges in large language models (LLMs), particularly their struggle to capture nuanced cultural and linguistic contexts beyond English-dominant��

Source

]]>
1
Nick Comly <![CDATA[Optimize AI Inference Performance with NVIDIA Full-Stack Solutions]]> http://www.open-lab.net/blog/?p=95310 2025-05-30T00:55:04Z 2025-01-24T16:00:00Z The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...]]> The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing...

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure.

Source

]]>
0
John Thomson <![CDATA[Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=95040 2025-04-23T15:02:57Z 2025-01-16T22:57:30Z Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the...]]> Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the...

Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context in LLM serving for generation of the next set of tokens. Caching these key and value elements from previous tokens avoids expensive recomputation and effectively leads to higher throughput. However��

Source

]]>
0
Sama Bali <![CDATA[GPU Memory Essentials for AI Performance]]> http://www.open-lab.net/blog/?p=94979 2025-01-23T19:54:24Z 2025-01-15T16:00:00Z Generative AI has revolutionized how people bring ideas to life, and agentic AI represents the next leap forward in this technological evolution. By leveraging...]]> Generative AI has revolutionized how people bring ideas to life, and agentic AI represents the next leap forward in this technological evolution. By leveraging...

Generative AI has revolutionized how people bring ideas to life, and agentic AI represents the next leap forward in this technological evolution. By leveraging sophisticated, autonomous reasoning and iterative planning, AI agents can tackle complex, multistep problems with remarkable efficiency. As AI continues to revolutionize industries, the demand for running AI models locally has surged.

Source

]]>
1
Rakib Hasan <![CDATA[NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference]]> http://www.open-lab.net/blog/?p=92963 2025-03-11T01:44:00Z 2024-12-18T17:31:01Z Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...]]> Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)...

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM. ReDrafter helps developers significantly boost LLM workload performance on NVIDIA GPUs. NVIDIA TensorRT-LLM is a library for optimizing LLM inference. It provides an easy-to-use Python API to��

Source

]]>
0
Anjali Shah <![CDATA[Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding]]> http://www.open-lab.net/blog/?p=94146 2024-12-19T23:03:40Z 2024-12-17T17:00:00Z Meta's Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only...]]> Meta's Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only...

Meta��s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including math, reasoning, coding��

Source

]]>
2
Michelle Horton <![CDATA[Top Posts of 2024 Highlight NVIDIA NIM, LLM Breakthroughs, and Data Science Optimization]]> http://www.open-lab.net/blog/?p=93566 2024-12-16T18:34:16Z 2024-12-16T18:34:14Z 2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to...]]> 2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to...

2024 was another landmark year for developers, researchers, and innovators working with NVIDIA technologies. From groundbreaking developments in AI inference to empowering open-source contributions, these blog posts highlight the breakthroughs that resonated most with our readers. NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale Introduced in��

Source

]]>
0
Anjali Shah <![CDATA[NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching]]> http://www.open-lab.net/blog/?p=93516 2024-12-12T19:35:15Z 2024-12-11T22:10:51Z NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...]]> NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes...Chat avatar between tiles with computer activity icons, on a black background.

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following: The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of��

Source

]]>
0
Amr Elmeleegy <![CDATA[Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack]]> http://www.open-lab.net/blog/?p=93396 2025-03-18T18:26:38Z 2024-12-05T17:58:43Z The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...]]> The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with...

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. The demand for AI-enabled services continues to grow rapidly, placing increasing pressure on IT and infrastructure teams. These teams are tasked with provisioning the necessary hardware and software to meet that demand while simultaneously��

Source

]]>
0
Carl (Izzy) Putterman <![CDATA[TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x]]> http://www.open-lab.net/blog/?p=92847 2025-01-11T17:32:51Z 2024-12-02T23:09:43Z NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...]]> NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that...Image of the TensorRT-LLM icon next to multiple other icons of computer activities.

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported��

Source

]]>
3
Manoj C R <![CDATA[Spotlight: TCS Increases Automotive Software Testing Speeds by 2x Using NVIDIA Generative AI]]> http://www.open-lab.net/blog/?p=92444 2024-12-12T19:38:36Z 2024-11-22T20:07:53Z Generative AI is transforming every aspect of the automotive industry, including software development, testing, user experience, personalization, and safety....]]> Generative AI is transforming every aspect of the automotive industry, including software development, testing, user experience, personalization, and safety....

Generative AI is transforming every aspect of the automotive industry, including software development, testing, user experience, personalization, and safety. With the automotive industry shifting from a mechanically driven approach to a software-driven one, generative AI is unlocking a world of possibilities. Tata Consultancy Services (TCS) focuses on two major segments for leveraging��

Source

]]>
0
Amr Elmeleegy <![CDATA[NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200]]> http://www.open-lab.net/blog/?p=92591 2024-12-12T19:47:20Z 2024-11-22T00:53:18Z Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...]]> Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series...Image of an HGX H200

Generative AI models are advancing rapidly. Every generation of models comes with a larger number of parameters and longer context windows. The Llama 2 series of models introduced in July 2023 had a context length of 4K tokens, and the Llama 3.1 models, introduced only a year later, dramatically expanded that to 128K tokens. While long context lengths allow models to perform cognitive tasks��

Source

]]>
1
Ashraf Eassa <![CDATA[Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=90142 2024-11-22T23:11:53Z 2024-11-19T16:00:00Z Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...]]> Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are...

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and��

Source

]]>
0
Bethann Noble <![CDATA[NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference]]> http://www.open-lab.net/blog/?p=92172 2024-11-20T04:40:21Z 2024-11-16T00:41:54Z The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice...]]> The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice...

The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice containers for AI model inference, constantly improving enterprise-grade generative AI performance. With the upcoming NIM version 1.4 scheduled for release in early December, request performance is improved by up to 2.4x out-of-the-box with��

Source

]]>
0
Amr Elmeleegy <![CDATA[Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill]]> http://www.open-lab.net/blog/?p=92052 2024-11-15T17:59:38Z 2024-11-15T17:59:35Z In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...]]> In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment...

In this blog post, we take a closer look at chunked prefill, a feature of NVIDIA TensorRT-LLM that increases GPU utilization and simplifies the deployment experience for developers. This builds on our previous post discussing how advanced KV cache optimization features in TensorRT-LLM improve performance up to 5x in use cases that require system prefills. When a user submits a request to��

Source

]]>
0
Amit Bleiweiss <![CDATA[Spotlight: Dataloop Accelerates Multimodal Data Preparation Pipelines for LLMs with NVIDIA NIM]]> http://www.open-lab.net/blog/?p=91071 2024-11-14T19:07:34Z 2024-11-12T17:00:00Z In the rapidly evolving landscape of AI, the preparation of high-quality datasets for large language models (LLMs) has become a critical challenge. It directly...]]> In the rapidly evolving landscape of AI, the preparation of high-quality datasets for large language models (LLMs) has become a critical challenge. It directly...Dataloop and NVIDIA logos on a black background.

In the rapidly evolving landscape of AI, the preparation of high-quality datasets for large language models (LLMs) has become a critical challenge. It directly affects a model��s accuracy, performance, and ability to generate reliable and unbiased outputs across diverse tasks and domains. Thanks to the partnership between NVIDIA and Dataloop, we are addressing this obstacle head-on��

Source

]]>
0
Amr Elmeleegy <![CDATA[5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse]]> http://www.open-lab.net/blog/?p=91625 2025-05-01T18:34:40Z 2024-11-08T23:55:43Z In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...]]> In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up...NVIDIA H100.

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups. LLM models are rapidly��

Source

]]>
0
Anton Korzh <![CDATA[3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot]]> http://www.open-lab.net/blog/?p=91412 2025-05-01T18:34:34Z 2024-11-01T22:00:36Z Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...]]> Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input...Image of an HGX H200

Deploying generative AI workloads in production environments where user numbers can fluctuate from hundreds to hundreds of thousands �C and where input sequence lengths differ with each request �C poses unique challenges. To achieve low latency inference in these environments, multi-GPU setups are a must �C irrespective of the GPU generation or its memory capacity. To enhance inference performance in��

Source

]]>
1
Maggie Zhang <![CDATA[Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes]]> http://www.open-lab.net/blog/?p=90412 2025-03-18T18:18:17Z 2024-10-22T16:53:55Z Large language models (LLMs) have been widely used for chatbots, content generation, summarization, classification, translation, and more. State-of-the-art LLMs...]]> Large language models (LLMs) have been widely used for chatbots, content generation, summarization, classification, translation, and more. State-of-the-art LLMs...

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. Large language models (LLMs) have been widely used for chatbots, content generation, summarization, classification, translation, and more. State-of-the-art LLMs and foundation models, such as Llama, Gemma, GPT, and Nemotron��

Source

]]>
0
Nick Comly <![CDATA[Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch]]> http://www.open-lab.net/blog/?p=90040 2024-11-22T23:12:12Z 2024-10-09T15:00:00Z The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...]]> The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of...

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items��

Source

]]>
1
Sharath Sreenivas <![CDATA[Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy]]> http://www.open-lab.net/blog/?p=87739 2024-10-17T18:51:42Z 2024-10-08T19:20:54Z This post was originally published August 21, 2024 but has been revised with current data. Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading...]]> This post was originally published August 21, 2024 but has been revised with current data. Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading...

This post was originally published August 21, 2024 but has been revised with current data. Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading state-of-the-art large language model (LLM). Mistral NeMo 12B consistently outperforms similarly sized models on a wide range of benchmarks. We announced Mistral-NeMo-Minitron 8B, one of the most advanced open-access models in��

Source

]]>
0
Jen Witsoe <![CDATA[Just Released: NVIDIA TensorRT-LLM 0.13.0]]> http://www.open-lab.net/blog/?p=89751 2024-10-17T19:06:58Z 2024-10-04T21:45:36Z Updates include tensor parallel support for Mamba2, sparse mixer normalization for MoE models, and more.]]> Updates include tensor parallel support for Mamba2, sparse mixer normalization for MoE models, and more.Decorative image of an atomic model icon connected to a computer monitor.

Updates include tensor parallel support for Mamba2, sparse mixer normalization for MoE models, and more.

Source

]]>
0
Delilah Liu <![CDATA[Evolving AI-Powered Game Development with Retrieval-Augmented Generation]]> http://www.open-lab.net/blog/?p=89574 2024-11-04T22:51:09Z 2024-10-01T17:00:00Z Game development is a complex and resource-intensive process, particularly when using advanced tools like Unreal Engine. Developers find themselves navigating...]]> Game development is a complex and resource-intensive process, particularly when using advanced tools like Unreal Engine. Developers find themselves navigating...

Game development is a complex and resource-intensive process, particularly when using advanced tools like Unreal Engine. Developers find themselves navigating through vast amounts of information, often scattered across tutorials, user manuals, API documentation, and the source code itself. This multifaceted journey requires expertise in programming, design, and project management��

Source

]]>
0
Nick Comly <![CDATA[Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance]]> http://www.open-lab.net/blog/?p=88938 2024-11-29T21:06:06Z 2024-09-26T21:44:00Z Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...]]> Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding...

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

Source

]]>
0
Anjali Shah <![CDATA[Deploying Accelerated Llama 3.2 from the Edge to the Cloud]]> http://www.open-lab.net/blog/?p=89436 2024-11-07T05:08:12Z 2024-09-25T18:39:49Z Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an...]]> Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an...

Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated computing platform, Llama 3.2 offers developers, researchers, and enterprises valuable new capabilities and optimizations to realize their��

Source

]]>
0
Akhiad Bercovich <![CDATA[Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B]]> http://www.open-lab.net/blog/?p=89283 2024-10-04T22:08:06Z 2024-09-23T16:41:34Z Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s...]]> Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s...

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta��s Llama-3.1-70B, uses a novel neural architecture search (NAS) approach that results in a highly accurate and efficient model. The model fits on a single NVIDIA H100 GPU at high workloads, making it much more accessible and affordable.

Source

]]>
2
Amit Bleiweiss <![CDATA[Spotlight: xpander AI Equips NVIDIA NIM Applications with Agentic Tools]]> http://www.open-lab.net/blog/?p=88694 2024-09-19T19:31:22Z 2024-09-11T17:21:53Z Equipping agentic AI applications with tools will usher in the next phase of AI. By enabling autonomous agents and other AI applications to fetch real-time...]]> Equipping agentic AI applications with tools will usher in the next phase of AI. By enabling autonomous agents and other AI applications to fetch real-time...

Equipping agentic AI applications with tools will usher in the next phase of AI. By enabling autonomous agents and other AI applications to fetch real-time data, perform actions, and interact with external systems, developers can bridge the gap to new, real-world use cases that significantly enhance productivity and the user experience. xpander AI, a member of the NVIDIA Inception program for��

Source

]]>
0
Jan Lasek <![CDATA[Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer]]> http://www.open-lab.net/blog/?p=88489 2024-09-19T19:33:05Z 2024-09-10T16:00:00Z As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of...]]> As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of...Illustration showing models and NeMo.

As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of serving such LLMs is becoming higher. One way to reduce this cost is to apply post-training quantization (PTQ), which consists of techniques to reduce computational and memory requirements for serving trained models. In this post��

Source

]]>
0
Ashraf Eassa <![CDATA[Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch]]> http://www.open-lab.net/blog/?p=88127 2024-11-29T21:06:37Z 2024-09-05T18:30:00Z As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...]]> As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that...Image of an HGX H200

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as ��one mighty GPU�� with ultra-fast GPU-to-GPU communication and advanced software able to take full��

Source

]]>
0
Anjali Shah <![CDATA[Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs]]> http://www.open-lab.net/blog/?p=88017 2024-11-14T15:58:41Z 2024-08-28T19:30:00Z The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...]]> The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a...

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high��

Source

]]>
1
Ashraf Eassa <![CDATA[NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1]]> http://www.open-lab.net/blog/?p=87957 2024-09-05T17:57:17Z 2024-08-28T15:00:00Z Large language model (LLM) inference is a full-stack challenge. Powerful GPUs, high-bandwidth GPU-to-GPU interconnects, efficient acceleration libraries, and a...]]> Large language model (LLM) inference is a full-stack challenge. Powerful GPUs, high-bandwidth GPU-to-GPU interconnects, efficient acceleration libraries, and a...

Large language model (LLM) inference is a full-stack challenge. Powerful GPUs, high-bandwidth GPU-to-GPU interconnects, efficient acceleration libraries, and a highly optimized inference engine are required for high-throughput, low-latency inference. MLPerf Inference v4.1 is the latest version of the popular and widely recognized MLPerf Inference benchmarks, developed by the MLCommons��

Source

]]>
1
Annamalai Chockalingam <![CDATA[Deploy Diverse AI Apps with Multi-LoRA Support on RTX AI PCs and Workstations]]> http://www.open-lab.net/blog/?p=88097 2025-06-10T18:47:45Z 2024-08-28T13:00:00Z Today��s large language models (LLMs) achieve unprecedented results across many use cases. Yet, application developers often need to customize and tune these...]]> Today��s large language models (LLMs) achieve unprecedented results across many use cases. Yet, application developers often need to customize and tune these...An illustration depicting AI model deployment steps.

Today��s large language models (LLMs) achieve unprecedented results across many use cases. Yet, application developers often need to customize and tune these models to work specifically for their use cases, due to the general nature of foundation models. Full fine-tuning requires a large amount of data and compute infrastructure, resulting in model weights being updated.

Source

]]>
0
Rajvir Singh <![CDATA[Optimizing Inference Efficiency for LLMs at Scale with NVIDIA NIM Microservices]]> http://www.open-lab.net/blog/?p=87091 2024-08-22T18:24:55Z 2024-08-14T19:30:00Z As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize...]]> As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize...

As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize throughput to lower operational costs and minimize latency to deliver superior user experiences. This post discusses the critical performance metrics of throughput and latency for LLMs, exploring their importance and trade-offs between��

Source

]]>
0
Sam Julien <![CDATA[Writer Releases Domain-Specific LLMs for Healthcare and Finance]]> http://www.open-lab.net/blog/?p=86736 2024-11-04T22:56:37Z 2024-08-07T17:28:40Z Writer has released two new domain-specific AI models, Palmyra-Med 70B and Palmyra-Fin 70B, expanding the capabilities of NVIDIA NIM. These models bring...]]> Writer has released two new domain-specific AI models, Palmyra-Med 70B and Palmyra-Fin 70B, expanding the capabilities of NVIDIA NIM. These models bring...

Writer has released two new domain-specific AI models, Palmyra-Med 70B and Palmyra-Fin 70B, expanding the capabilities of NVIDIA NIM. These models bring unparalleled accuracy to medical and financial generative AI applications��outperforming comparable models like GPT-4, Med-PaLM 2, and Claude 3.5 Sonnet. While general-purpose large language models (LLMs) have captured recent headlines��

Source

]]>
0
Asher Fredman <![CDATA[Accelerating Hebrew LLM Performance with NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=86776 2024-08-22T18:25:40Z 2024-08-06T16:03:53Z Developing a high-performing Hebrew large language model (LLM) presents distinct challenges stemming from the rich and complex nature of the Hebrew language...]]> Developing a high-performing Hebrew large language model (LLM) presents distinct challenges stemming from the rich and complex nature of the Hebrew language...

Developing a high-performing Hebrew large language model (LLM) presents distinct challenges stemming from the rich and complex nature of the Hebrew language itself. The intricate structure of Hebrew, with words formed through root and pattern combinations, demands sophisticated modeling approaches. Moreover, the lack of capitalization and the frequent absence of punctuation like periods and commas��

Source

]]>
1
Chintan Patel <![CDATA[Revolutionizing Code Completion with Codestral Mamba, the Next-Gen Coding LLM]]> http://www.open-lab.net/blog/?p=85101 2024-08-08T18:48:30Z 2024-07-25T19:57:14Z In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software...]]> In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software...

In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software development. They provide significant benefits by automating complex tasks, enhancing scalability, and fostering innovation, making them invaluable tools in modern software development. This post explores the benefits of Codestral Mamba��

Source

]]>
0
Tianna Nguy <![CDATA[New Workshops: Customize LLMs, Build and Deploy Large Neural Networks]]> http://www.open-lab.net/blog/?p=85505 2024-08-08T18:48:51Z 2024-07-16T21:39:50Z Register now for an instructor-led public workshop in July, August or September. Space is limited.]]> Register now for an instructor-led public workshop in July, August or September. Space is limited.

Register now for an instructor-led public workshop in July, August or September. Space is limited.

Source

]]>
0
Ashraf Eassa <![CDATA[Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=84749 2024-08-07T23:50:14Z 2024-07-02T18:00:00Z As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to...]]> As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to...

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. Delivering high LLM inference performance requires an efficient parallel computing architecture and a flexible and highly optimized software stack. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set��

Source

]]>
0
Hannah Simmons <![CDATA[Google��s New Gemma 2 Model Now Optimized and Available on NVIDIA API Catalog]]> http://www.open-lab.net/blog/?p=84688 2024-07-25T18:19:20Z 2024-07-01T16:00:00Z Gemma 2, the next generation of Google Gemma models, is now optimized with TensorRT-LLM and packaged as NVIDIA NIM inference microservice.]]> Gemma 2, the next generation of Google Gemma models, is now optimized with TensorRT-LLM and packaged as NVIDIA NIM inference microservice.

Gemma 2, the next generation of Google Gemma models, is now optimized with TensorRT-LLM and packaged as NVIDIA NIM inference microservice.

Source

]]>
1
Belen Tegegn <![CDATA[Create RAG Applications Using NVIDIA NIM and Haystack on Kubernetes]]> http://www.open-lab.net/blog/?p=84683 2024-07-10T15:28:32Z 2024-06-28T15:00:00Z Step-by-step guide to build robust, scalable RAG apps with Haystack and NVIDIA NIMs on Kubernetes.]]> Step-by-step guide to build robust, scalable RAG apps with Haystack and NVIDIA NIMs on Kubernetes.

Step-by-step guide to build robust, scalable RAG apps with Haystack and NVIDIA NIMs on Kubernetes.

Source

]]>
0
Amr Elmeleegy <![CDATA[Demystifying AI Inference Deployments for Trillion Parameter Large Language Models]]> http://www.open-lab.net/blog/?p=83013 2025-03-18T18:27:34Z 2024-06-12T16:00:00Z AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as...]]> AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as...Decorative image.

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. AI is transforming every industry, addressing grand human scientific challenges such as precision drug discovery and the development of autonomous vehicles, as well as solving commercial problems such as automating the creation of e-commerce��

Source

]]>
2
Gunjan Mehta <![CDATA[Maximum Performance and Minimum Footprint for AI Apps with NVIDIA TensorRT Weight-Stripped Engines]]> http://www.open-lab.net/blog/?p=83568 2024-11-14T15:55:20Z 2024-06-11T16:33:50Z NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX...]]> NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX...Decorative image of TensorRT workflow on a black background.

NVIDIA TensorRT, an established inference library for data centers, has rapidly emerged as a desirable inference backend for NVIDIA GeForce RTX and NVIDIA RTX GPUs. Now, deploying TensorRT into apps has gotten even easier with prebuilt TensorRT engines. The newly released TensorRT 10.0 with weight-stripped engines offers a unique solution for minimizing the engine shipment size by reducing��

Source

]]>
0
Jig Bhadaliya <![CDATA[NVIDIA Collaborates with Hugging Face to Simplify Generative AI Model Deployments]]> http://www.open-lab.net/blog/?p=83346 2024-08-29T19:48:21Z 2024-06-03T15:00:00Z As generative AI experiences rapid growth, the community has stepped up to foster this expansion in two significant ways: swiftly publishing state-of-the-art...]]> As generative AI experiences rapid growth, the community has stepped up to foster this expansion in two significant ways: swiftly publishing state-of-the-art...

As generative AI experiences rapid growth, the community has stepped up to foster this expansion in two significant ways: swiftly publishing state-of-the-art foundational models, and streamlining their integration into application development and production. NVIDIA is aiding this effort by optimizing foundation models to enhance performance, allowing enterprises to generate tokens faster��

Source

]]>
0
Nisanur Genc <![CDATA[Personalized Learning with Gipi, NVIDIA TensortRT-LLM, and AI Foundation Models]]> http://www.open-lab.net/blog/?p=82913 2024-05-30T19:55:44Z 2024-05-30T16:00:00Z Over 1.2B people are actively learning new languages, with over 500M learners on digital learning platforms such as Duolingo. At the same time, a significant...]]> Over 1.2B people are actively learning new languages, with over 500M learners on digital learning platforms such as Duolingo. At the same time, a significant...Stylized image of a smartphone chat with a young woman smiling off to one side.

Over 1.2B people are actively learning new languages, with over 500M learners on digital learning platforms such as Duolingo. At the same time, a significant portion of the global population, including 73% of Gen-Z, experiences feelings of disconnection and unhappiness, often exacerbated by social media. This highlights a unique dichotomy: People are hungry for personalized learning��

Source

]]>
0
Erin Ho <![CDATA[Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available]]> http://www.open-lab.net/blog/?p=81860 2024-06-13T22:22:46Z 2024-05-08T19:00:00Z In the fast-evolving landscape of generative AI, the demand for accelerated inference speed remains a pressing concern. With the exponential growth in model...]]> In the fast-evolving landscape of generative AI, the demand for accelerated inference speed remains a pressing concern. With the exponential growth in model...

In the fast-evolving landscape of generative AI, the demand for accelerated inference speed remains a pressing concern. With the exponential growth in model size and complexity, the need to swiftly produce results to serve numerous users simultaneously continues to grow. The NVIDIA platform stands at the forefront of this endeavor, delivering perpetual performance leaps through innovations across��

Source

]]>
3
Anjali Shah <![CDATA[Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server]]> http://www.open-lab.net/blog/?p=81223 2024-11-14T15:54:32Z 2024-04-28T18:07:15Z We're excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You...]]> We're excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You...Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

We��re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama 3 70B��the first models in the series��through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the NVIDIA API catalog, where Llama 3 is packaged as��

Source

]]>
61
Chintan Patel <![CDATA[Mistral Large and Mixtral 8x22B LLMs Now Powered by NVIDIA NIM and NVIDIA API]]> http://www.open-lab.net/blog/?p=80850 2024-06-06T14:50:14Z 2024-04-22T17:00:00Z This week��s model release features two new NVIDIA AI Foundation models, Mistral Large and Mixtral 8x22B, both developed by Mistral AI. These cutting-edge...]]> This week��s model release features two new NVIDIA AI Foundation models, Mistral Large and Mixtral 8x22B, both developed by Mistral AI. These cutting-edge...

This week��s model release features two new NVIDIA AI Foundation models, Mistral Large and Mixtral 8x22B, both developed by Mistral AI. These cutting-edge text-generation AI models are supported by NVIDIA NIM microservices, which provide prebuilt containers powered by NVIDIA inference software that enable developers to reduce deployment times from weeks to minutes. Both models are available through��

Source

]]>
1
Amit Bleiweiss <![CDATA[Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM]]> http://www.open-lab.net/blog/?p=80481 2024-09-04T22:42:54Z 2024-04-02T17:00:00Z Large language models (LLMs) have revolutionized natural language processing (NLP) with their ability to learn from massive amounts of text and generate fluent...]]> Large language models (LLMs) have revolutionized natural language processing (NLP) with their ability to learn from massive amounts of text and generate fluent...

Large language models (LLMs) have revolutionized natural language processing (NLP) with their ability to learn from massive amounts of text and generate fluent and coherent texts for various tasks and domains. However, customizing LLMs is a challenging task, often requiring a full training process that is time-consuming and computationally expensive. Moreover, training LLMs requires a diverse and��

Source

]]>
2
Ashraf Eassa <![CDATA[NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records]]> http://www.open-lab.net/blog/?p=80197 2024-11-14T15:53:12Z 2024-03-27T15:29:05Z Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI...]]> Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI...An image of an NVIDIA H200 Tensor Core GPU.

Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI models��including large language models (LLMs)��are used for crafting marketing copy, writing computer code, rendering detailed images, composing music, generating videos, and more. The amount of compute required by the latest models is immense and��

Source

]]>
0
Anjali Shah <![CDATA[NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma?]]> http://www.open-lab.net/blog/?p=78037 2024-11-14T15:52:24Z 2024-02-21T13:00:00Z NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology...]]> NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology...An illustration representing LLM optimization.

NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology used to create the Gemini models. An optimized release with TensorRT-LLM enables users to develop with LLMs using only a desktop with an NVIDIA RTX GPU. Created by Google DeepMind, Gemma 2B and Gemma 7B��the first models in the series��

Source

]]>
0
Chintan Patel <![CDATA[Generate Code, Answer Queries, and Translate Text with New NVIDIA AI Foundation Models]]> http://www.open-lab.net/blog/?p=77364 2024-05-07T19:14:10Z 2024-02-05T18:48:17Z This week��s Model Monday release features the NVIDIA-optimized code Llama, Kosmos-2, and SeamlessM4T, which you can experience directly from your browser....]]> This week��s Model Monday release features the NVIDIA-optimized code Llama, Kosmos-2, and SeamlessM4T, which you can experience directly from your browser....

This week��s Model Monday release features the NVIDIA-optimized code Llama, Kosmos-2, and SeamlessM4T, which you can experience directly from your browser. With NVIDIA AI Foundation Models and Endpoints, you can access a curated set of community and NVIDIA-built generative AI models to experience, customize, and deploy in enterprise applications. Meta��s Code Llama 70B is the latest��

Source

]]>
0
Amit Bleiweiss <![CDATA[Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton]]> http://www.open-lab.net/blog/?p=77200 2024-05-07T19:14:23Z 2024-02-01T21:00:00Z Large language models (LLMs) have revolutionized the field of AI, creating entirely new ways of interacting with the digital world. While they provide a good...]]> Large language models (LLMs) have revolutionized the field of AI, creating entirely new ways of interacting with the digital world. While they provide a good...

Large language models (LLMs) have revolutionized the field of AI, creating entirely new ways of interacting with the digital world. While they provide a good generalized solution, they often must be tuned to support specific domains and tasks. AI coding assistants, or code LLMs, have emerged as one domain to help accomplish this. By 2025, 80% of the product development lifecycle will make��

Source

]]>
0
Jesse Clayton <![CDATA[Get Started with Generative AI Development for Windows PCs with NVIDIA RTX]]> http://www.open-lab.net/blog/?p=76227 2024-11-14T16:14:11Z 2024-01-08T16:30:00Z Generative AI and large language models (LLMs) are changing human-computer interaction as we know it. Many use cases would benefit from running LLMs locally on...]]> Generative AI and large language models (LLMs) are changing human-computer interaction as we know it. Many use cases would benefit from running LLMs locally on...

Generative AI and large language models (LLMs) are changing human-computer interaction as we know it. Many use cases would benefit from running LLMs locally on Windows PCs, including gaming, creativity, productivity, and developer experiences. This post discusses several NVIDIA end-to-end developer tools for creating and deploying both text-based and visual LLM applications on NVIDIA RTX AI-ready��

Source

]]>
7
Michelle Horton <![CDATA[Most Popular NVIDIA Technical Blog Posts of 2023: Generative AI, LLMs, Robotics, and Virtual Worlds Breakthroughs]]> http://www.open-lab.net/blog/?p=74885 2024-12-12T18:18:56Z 2023-12-19T17:50:21Z As we approach the end of another exciting year at NVIDIA, it's time to look back at the most popular stories from the NVIDIA Technical Blog in 2023....]]> As we approach the end of another exciting year at NVIDIA, it's time to look back at the most popular stories from the NVIDIA Technical Blog in 2023....Toy Jensen generative AI.

As we approach the end of another exciting year at NVIDIA, it��s time to look back at the most popular stories from the NVIDIA Technical Blog in 2023. Groundbreaking research and developments in fields such as generative AI, large language models (LLMs), high-performance computing (HPC), and robotics are leading the way in transformative AI solutions and capturing the interest of our readers.

Source

]]>
0
���˳���97caoporen����