In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows.
]]>With the growth of large language models (LLMs), deep learning is advancing both model architecture design and computational efficiency. Mixed precision training, which strategically employs lower precision formats like brain floating point 16 (BF16) for computationally intensive operations while retaining the stability of 32-bit floating-point (FP32) where needed, has been a key strategy for…
]]>In the previous post, Profiling LLM Training Workflows on NVIDIA Grace Hopper, we explored the importance of profiling large language model (LLM) training workflows and analyzed bottlenecks using NVIDIA Nsight Systems. We also discussed how the NVIDIA GH200 Grace Hopper Superchip enables efficient training processes. While profiling helps identify inefficiencies…
]]>The rapid advancements in AI have resulted in an era of exponential growth in model sizes, particularly in the domain of large language models (LLMs). These models, with their transformative capabilities, are driving innovation across industries. However, the increasing complexity and computational demands of training such models necessitate a meticulous approach to optimization and profiling.
]]>This is the first post in the LLM Benchmarking series, which shows how to use GenAI-Perf to benchmark the Meta Llama 3 model when deployed with NVIDIA NIM. Researchers from the University College London (UCL) Deciding, Acting, and Reasoning with Knowledge (DARK) Lab leverage NVIDIA NIM microservices in their new game-based benchmark suite, Benchmarking Agentic LLM and VLM Reasoning On Games…
]]>