A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or reinforcement learning (RL), and model evaluation. Each stage requires using different libraries, which are often challenging to set up and difficult to use together. For example, you might use NVIDIA TensorRT-LLM or vLLM for SDG and NVIDIA��
]]>NVIDIA Run:ai and Amazon Web Services have introduced an integration that lets developers seamlessly scale and manage complex AI training workloads. Combining AWS SageMaker HyperPod and Run:ai��s advanced AI workload and GPU orchestration platform improves efficiency and flexibility. Amazon SageMaker HyperPod provides a fully resilient, persistent cluster that��s purpose-built for large-scale��
]]>LMArena at the University of California, Berkeley is making it easier to see which large language models excel at specific tasks, thanks to help from NVIDIA and Nebius. Its rankings, powered by the Prompt-to-Leaderboard (P2L) model, collect votes from humans on which AI performs best in areas such as math, coding, or creative writing. ��We capture user preferences across tasks and apply��
]]>The evolution of large language models (LLMs) has been marked by significant advancements in their ability to process and generate text. Among these developments, the concept of context length��the number of tokens in a single input sample that a model can handle��has emerged as a critical factor defining what these models can achieve across diverse applications. For instance��
]]>Training AI models on massive GPU clusters presents significant challenges for model builders. Because manual intervention becomes impractical as job scale increases, automation is critical to maintaining high GPU utilization and training productivity. An exceptional training experience requires resilient systems that provide low-latency error attribution and automatic fail over based on root��
]]>Matrix multiplication and attention mechanisms are the computational backbone of modern AI workloads. While libraries like NVIDIA cuDNN provide highly optimized implementations, and frameworks such as CUTLASS offer deep customization, many developers and researchers need a middle ground that combines performance with programmability. The open-source Triton compiler on the NVIDIA Blackwell��
]]>Transformers, with their attention-based architecture, have become the dominant choice for language models (LMs) due to their strong performance, parallelization capabilities, and long-term recall through key-value (KV) caches. However, their quadratic computational cost and high memory demands pose efficiency challenges. In contrast, state space models (SSMs) like Mamba and Mamba-2 offer constant��
]]>As models grow larger and are trained on more data, they become more capable, making them more useful. To train these models quickly, more performance, delivered at data center scale, is required. The NVIDIA Blackwell platform, launched at GTC 2024 and now in full production, integrates seven types of chips: GPU, CPU, DPU, NVLink Switch chip, InfiniBand Switch, and Ethernet Switch.
]]>NVIDIA has announced the latest v0.15 release of NVIDIA TensorRT Model Optimizer, a state-of-the-art quantization toolkit of model optimization techniques including quantization, sparsity, and pruning. These techniques reduce model complexity and enable downstream inference frameworks like NVIDIA TensorRT-LLM and NVIDIA TensorRT to more efficiently optimize the inference speed of generative AI��
]]>Today��s large language models (LLMs) are based on the transformer model architecture introduced in 2017. Since then, rapid advances in AI compute performance have enabled the creation of even larger transformer-based LLMs, dramatically improving their capabilities. Advanced transformer-based LLMs are enabling many exciting applications such as intelligent chatbots, computer code generation��
]]>Generative AI, the ability of algorithms to process various types of inputs��such as text, images, audio, video, and code��and generate new content, is advancing at an unprecedented rate. While this technology is making significant strides across multiple industries, one sector that stands to benefit immensely is the Architecture, Engineering, and Construction (AEC) industry.
]]>Generative AI models have a variety of uses, such as helping write computer code, crafting stories, composing music, generating images, producing videos, and more. And, as these models continue to grow in size and are trained on even more data, they are producing even higher-quality outputs. Building and deploying these more intelligent models is incredibly compute-intensive��
]]>Note: As of January 6, 2025, VILA is now part of the Cosmos Nemotron VLM family. NVIDIA is proud to announce the release of NVIDIA Cosmos Nemotron, a family of state-of-the-art vision language models (VLMs) designed to query and summarize images and videos from physical or virtual environments. Cosmos Nemotron builds upon NVIDIA��s groundbreaking visual understanding research including VILA��
]]>GPUs were initially specialized for rendering 3D graphics in video games, primarily to accelerate linear algebra calculations. Today, GPUs have become one of the critical components of the AI revolution. We now rely on these workhorses to fulfill deep learning workloads, crunching through massive and complex semi-structured datasets. However, as demand for AI-based solutions has��
]]>After exploring the fundamentals of diffusion model sampling, parameterization, and training as explained in Generative AI Research Spotlight: Demystifying Diffusion-Based Models, our team began investigating the internals of these network architectures. This turned out to be a frustrating exercise. Any direct attempt to improve these models tended to worsen the results. They seemed to be in��
]]>With Internet-scale data, the computational demands of AI-generated content have grown significantly, with data centers running full steam for weeks or months to train a single model��not to mention the high inference costs in generation, often offered as a service. In this context, suboptimal algorithmic design that sacrifices performance is an expensive mistake. Much of the recent progress��
]]>At AWS re:Invent 2023, AWS and NVIDIA announced that AWS will be the first cloud provider to offer NVIDIA GH200 Grace Hopper Superchips interconnected with NVIDIA NVLink technology through NVIDIA DGX Cloud and running on Amazon Elastic Compute Cloud (Amazon EC2). This is a game-changing technology for cloud computing. The NVIDIA GH200 NVL32, a rack-scale solution within NVIDIA DGX Cloud or an��
]]>Large language models (LLMs) are a class of generative AI models built using transformer networks that can recognize, summarize, translate, predict, and generate language using very large datasets. LLMs have the promise of transforming society as we know it, yet training these foundation models is incredibly challenging. This blog articulates the basic principles behind LLMs��
]]>