NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance…
]]>Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder…
]]>Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context in LLM serving for generation of the next set of tokens. Caching these key and value elements from previous tokens avoids expensive recomputation and effectively leads to higher throughput. However…
]]>Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including math, reasoning, coding…
]]>NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following: The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of…
]]>NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported…
]]>Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and…
]]>Expanding the open-source Meta Llama collection of models, the Llama 3.2 collection includes vision language models (VLMs), small language models (SLMs), and an updated Llama Guard model with support for vision. When paired with the NVIDIA accelerated computing platform, Llama 3.2 offers developers, researchers, and enterprises valuable new capabilities and optimizations to realize their…
]]>The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high…
]]>AI21 Labs has unveiled their latest and most advanced Jamba 1.5 model family, a cutting-edge collection of large language models (LLMs) designed to excel in a wide array of generative AI tasks. These models are capable of creating content, summarizing and comparing documents, and extracting valuable insights from vast datasets. This mixture of experts (MoE) model takes advantage of the…
]]>NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing number of language models purpose-built for select tasks, NVIDIA Research and Mistral AI combined forces to offer a versatile, open language model that’s performant and runs on a single GPU, such as NVIDIA A100 or H100 GPUs.
]]>In the rapidly evolving field of generative AI, coding models have become indispensable tools for developers, enhancing productivity and precision in software development. They provide significant benefits by automating complex tasks, enhancing scalability, and fostering innovation, making them invaluable tools in modern software development. This post explores the benefits of Codestral Mamba…
]]>Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications. With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as a suite of trust and safety models…
]]>We’re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama 3 70B—the first models in the series—through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from the NVIDIA API catalog, where Llama 3 is packaged as…
]]>As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly. Diffusion models are transforming creative workflows across industries. These models generate stunning images based on simple text or image inputs by iteratively shaping random noise into AI-generated art through denoising diffusion…
]]>NVIDIA is collaborating as a launch partner with Google in delivering Gemma, a newly optimized family of open models built from the same research and technology used to create the Gemini models. An optimized release with TensorRT-LLM enables users to develop with LLMs using only a desktop with an NVIDIA RTX GPU. Created by Google DeepMind, Gemma 2B and Gemma 7B—the first models in the series…
]]>Large language models (LLMs) are a class of generative AI models built using transformer networks that can recognize, summarize, translate, predict, and generate language using very large datasets. LLMs have the promise of transforming society as we know it, yet training these foundation models is incredibly challenging. This blog articulates the basic principles behind LLMs…
]]>Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short in meeting the specific needs of enterprises due to industry-specific terminology, domain expertise, or unique requirements. This is where custom LLMs come into play.
]]>