• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Generative AI

    Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

    Decorative image.

    In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows.

    This post explores the practical realities of FP8 training, focusing on NVIDIA Nemotron experiments. We’ll explore why per-tensor and delayed scaling are necessary, where they fall short, and how advanced recipes, such as per-block sub-channel scaling, can unlock stable, efficient FP8 training for large models.

    In a previous post, we introduced FP8 precision and compared its two primary formats, E4M3 and E5M2, to more established types like BF16.

    Per-tensor scaling

    Per-tensor scaling is an essential FP8 strategy that assigns a unique scaling factor to each tensor—such as weights, activations, or gradients—instead of using a single global scale. This is crucial because FP8’s narrow dynamic range can’t handle the diverse value distributions found across different tensors. By tailoring the scaling factor to each tensor’s statistics, per-tensor scaling helps prevent numerical instability and ensures more accurate, stable training.

    Delayed scaling and current scaling are two common approaches within per-tensor scaling. Each offers a different strategy for setting scaling factors, but both operate at the individual tensor level to address FP8’s unique requirements.

    Per-tensor delayed scaling

    Delayed scaling is a practical FP8 scaling strategy that computes the scaling factor for each tensor based on the maximum absolute values (amax) observed over a window of previous iterations, rather than just the current batch. This history-based approach smooths out outliers and reduces the risk of abrupt scaling changes that could destabilize training.

    Instead of deciding the scaling factor based on the current batch of data, delayed scaling looks at the largest values (called “amax”) from several recent batches and keeps a running history of these maximums. 

    When it’s time to choose a new scaling factor, the system reviews this history and picks a value that reflects the recent trends, rather than just the most recent batch. This approach helps smooth out sudden spikes or drops in data values, making the training process more stable and reducing the risk of abrupt changes that could hurt model performance. 

    Delayed scaling, where the FP8 operator uses a scaling factor obtained from the history of amaxes (maximums of absolute values) seen in some number of previous iterations and produces both the FP8 output and the current amax, which is then stored in the history.
    Figure 1. Delayed scaling strategy

    While delayed scaling is practical for FP8 training, its effectiveness hinges on a critical assumption: that the statistical distribution of tensor values remains relatively stable across training iterations. However, outliers can dominate the amax history, causing suboptimal scaling that leads to underflow or overflow and potentially destabilizes training, particularly in large-scale runs.

    Per-tensor current scaling

    While delayed scaling uses historical data, per-tensor current scaling offers an alternative that prioritizes real-time adaptability. This method determines the scaling factor for each tensor based on the statistical properties of that tensor in the current forward or backward pass.

    Per-tensor current scaling dynamically adjusts the scale based on the present data range. This immediate responsiveness helps optimize the FP8 representation and has been observed to improve model convergence during training.

    It works by measuring the largest absolute value (amax) in each tensor—whether it’s an activation, weight, or gradient—during every training batch. This measurement is taken in real time, without relying on any history from previous batches. 

    The scaling factor is then set so that this amax fits within the FP8 range, where each tensor uses the full available precision of FP8 for that specific step. 

    With this scaling factor, the tensor’s floating-point values are quantized to FP8, enabling both efficient and accurate training at every iteration. The benefits of current scaling are as follows:

    • Instantaneous adaptation: Current scaling adapts to real-time changes in data distribution, learning rate schedules, or model architectures. This live responsiveness ensures no lag or mismatch in scaling.
    • Simplicity and efficiency: The absence of complex historical buffers or tracking mechanisms simplifies implementation and reduces both computational and memory overhead. Resources remain focused directly on model training.
    • Robustness: Real-time adjustments make current scaling more robust to outliers within the current batch. Since the scale isn’t influenced by past, stale, or erroneous amax values, it becomes a more stable training experience for demanding deep learning pipelines.
    Plots showing the evaluation of the Nemotron5 8B model on the Average Reasoning benchmark. The plot shows two lines: Nemotron trained using BF16 and current scaling (CS) FP8.
    Figure 2. Massive Multitask Language Understanding (MMLU) and average reasoning evaluation of Nemotron5 8B BF16 compared to FP8

    Figure 2 highlights the efficiency of lower-precision formats compared to the BF16 baseline on challenging benchmarks such as average reasoning. 

    While BF16 maintains a slight edge in total accuracy across the token range, current scaling tracks closely behind, with the gap remaining within a narrow percentage margin.

    Per-block scaling

    Diagram comparing per-tensor and per-block scaling factors.
    Figure 3. Difference between per-tensor and per-block scaling factors

    While per-tensor scaling methods offer a foundational approach to FP8 training, they often face limitations when dealing with block-level variability within a single tensor. 

    This challenge arises when different regions or slices within a tensor exhibit different numerical ranges and statistical behaviors. For example, in large transformer models, distinct attention heads within a single weight matrix can possess different magnitudes. Similarly, activations from advanced layers like SwiGLU or mixture-of-experts (MoE) routers are notorious for producing a heterogeneous mix of dense clusters of small values alongside rare but extreme outliers.

    When a single scaling factor is applied to a diverse tensor, either precision is sacrificed in the dense, lower-magnitude regions, or extreme outliers risk being clipped. Both can degrade training stability and the ultimate quality of the model.

    Per-block scaling is a powerful solution that divides each tensor into smaller, more manageable, contiguous blocks and assigns a dedicated scaling factor to each one. 

    The scaling mechanism can adapt to the local statistics of each block, rather than being dictated by the single most extreme value. This means that high-magnitude regions are accurately represented without compromising the fidelity of smaller, more typical values residing within the same tensor.

    As deep learning models continue to grow in size and architectural complexity, fine-grained, adaptive scaling becomes essential for preserving numerical stability, unlocking efficiency and speed that FP8 offers on modern hardware. 

    Next, we explore how this advanced scaling concept is implemented in practice, with leading recipes such as the hardware-native MXFP8 and other configurable block scaling approaches.

    What is Micro-Scaling FP8?

    Building upon the concept of per-block scaling, Micro-Scaling FP8 (MXFP8) represents the NVIDIA Blackwell hardware-level solution for achieving efficient and stable FP8 training. 

    MXFP8 aligns with the broader MX data format standard, which defines a generalized framework for shared, fine-grained block scaling across various low-precision formats (FP8, FP6, FP4, and Int8). This standard is characterized by three key components:

    • Scale (X) data type/encoding: Defines how the scaling factor itself is represented.
    • Element (P_i) data type/encoding: Specifies the data type of the actual numerical elements within each block.
    • Scaling block size (k): Determines the number of scalar elements that share a common scaling factor.
    Conceptual diagram illustrating block scaling, where a shared scale (X) is applied to k scalar elements (P1 to Pk) within a block, contrasting with a single global scale.
    Figure 4. MXFP8 block-scaling

    How does MXFP8 work?

    MXFP8 implements blockwise scaling with specific characteristics optimized for Blackwell architecture:

    1. Blockwise division: Tensors are systematically divided into contiguous blocks, each containing 32 consecutive values, matched to the design of NVIDIA Blackwell Tensor Cores.
    2. Exponent-only scaling factors: Each of the 32-value blocks is assigned a dedicated scaling factor stored in an E8M0 format (8 bits of exponent, 0 bits of mantissa) and represents a power-of-2 multiplier. This optimizes hardware implementation and maintains numerical properties beneficial for deep learning.
    3. Hardware requantization for transposes: Given that scaling occurs in a particular direction (e.g., row-wise or column-wise), an MXFP8 tensor and its transpose aren’t numerically equivalent. When operations require the tensor and its transpose (common in forward and backward passes), Blackwell units perform automatic requantization from the high-precision input. This preserves accuracy by avoiding compounding quantization errors from transposing a quantized FP8 tensor.

    This innovative blockwise scaling enables MXFP8 to overcome the limitations of per-tensor scaling, accommodating variations in magnitude within a single tensor. By preserving high and low-magnitude regions, MXFP8 maximizes the utilization of FP8’s dynamic range and minimizes quantization error. This translates to faster training and improved accuracy, particularly for large models.

    Two line graphs comparing validation perplexity over training tokens for Nemotron models. Both graphs indicate minimal differences between MXFP8 and BF16 performance.
    Figure 5. Validation perplexity during training for 2B and 8B parameter Nemotron models using MXFP8 and BF16

    Figure 5 compares the BF16 and MXFP8 losses for the Nemotron 2b and 8b models. In the 2B parameter Nemotron model, MXFP8 shows the same validation loss curve as BF16, even as the number of training tokens increases. This means that there are no significant accuracy differences between using BF16 and MXFP8 data types for this model and that the model is converging as expected. 

    With the 8B parameter model, MXFP8 exhibits the same validation loss curve as BF16 even as the number of training tokens increases. This shows again that MXFP8 is converging similarly to the BF16 without any major changes in accuracy, even on an increased model size.

    Block scaling 

    Beyond per-tensor scaling strategies and hardware-specific block scaling like MXFP8, generic FP8 block scaling is a versatile and configurable approach to fine-grained precision control. 

    This technique stands out for its adaptability across a broad range of model architectures and hardware requirements.

    Three side-by-side diagrams showing different per-block scaling approaches.FP8 block scaling operates by dividing each tensor into smaller, user-defined blocks, where all values within a given block share a common scaling factor. Unlike approaches with a fixed block size, such as MXFP8, generic FP8 block scaling enables configurable block dimensions.
    Figure 6. Comparison of per-block scaling methods for matrix multiplication

    Users can specify various shapes—for example, 1×128 or 128×128—to optimize scaling for different tensor structures. These block-specific scaling factors are typically stored in FP32 format, offering robust numerical properties.

    This configurable granularity enables a more precise adaptation to the unique statistical behaviors of different model components, making it a powerful tool for optimizing FP8 training across diverse workloads.

    How does block scaling work?

    The core principle behind FP8 block scaling is to enhance precision by adapting the scaling factor to localized data characteristics. Unlike per-tensor methods that apply a single scale across an entire tensor, block scaling divides each tensor into smaller, distinct segments. Within each of these defined blocks, all values share a common scaling factor stored separately in FP32 for accuracy. 

    Quantization scales elements by their block’s factor and converts them to FP8; dequantization reconstructs originals by multiplying FP8 values by the same factor. The choice of block size is crucial: smaller blocks reduce quantization error through finer granularity, while larger blocks lower storage overhead at the cost of increased noise.

    Recipes in the NVIDIA NeMo Framework

    Transformer Engine FP8 recipes are exposed through high-level configurations, like in the NVIDIA NeMo framework. Users can select different fp8_recipe flags for scaling strategies, often combined with BF16 for mixed precision training.

    • Delayed Scaling: fp8_recipe = "delayed"
    • Per-Tensor Current Scaling: fp8_recipe = "tensorwise"
    • MXFP8: fp8_recipe = "mxfp8"
    • Generic Block Scaling: fp8_recipe = "blockwise"

    Conclusion

    FP8 training is a practical and efficient solution for large-scale deep learning. The key to unlocking the full potential of FP8 lies in the choice and implementation of scaling strategies. 
    Explore these techniques by visiting the FP8 recipes and get started with practical FP8 training configurations and code.

    Discuss (0)
    +8

    Tags

    人人超碰97caoporen国产