• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Data Center / Cloud

    Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks

    The previous post, NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0, explains how the NVIDIA platform delivered the fastest time to train across all seven benchmarks in this latest MLPerf round. This post provides a guide to reproduce the performance of NVIDIA MLPerf v5.0 submissions of Llama 2 70B LoRA fine-tuning and Llama 405B pretraining. Submission repositories also include README files to reproduce the scores. See, for example, those for the Llama 2 70B LoRA fine-tuning benchmark and the Llama 3.1 405B benchmark

    Prerequisites

    Running NVIDIA benchmarks requires your system to have the following:

    • Container preparation, dataset/checkpoint download and preprocessing
      • Docker
      • A Hugging Face access token (for dataset/checkpoint download)
      • At least 2.5 TB of disk space for Llama 3.1, 300 GB for LoRA fine-tuning
    • Hardware requirements
      • Llama 2 70B LoRA: An NVIDIA DGX B200 or NVIDIA GB200 NVL72 system, or multiple GB200 NVL72 systems connected with InfiniBand for scales larger than 72 GPUs. The smallest NVIDIA submission for this benchmark is eight GPUs.
      • Llama 3.1 405B: At least four GB200 NVL72 systems connected with InfiniBand. The smallest NVIDIA submission for this benchmark is 256 GPUs. 

    Cluster setup

    Running NVIDIA MLPerf Training benchmarks requires:

    • Environment based on Slurm, Pyxis, and Enroot
    • Networking with NVIDIA NVLink and InfiniBand
    • Fast local storage set up in RAID0 configuration to minimize data loading bottlenecks

    NVIDIA submission clusters do not support running workloads with Docker. The clusters are governed by the NVIDIA Base Command Manager (BCM). Follow the official instructions to properly set up a BCM SLURM cluster.

    After a proper setup, you should be able to log in to the head node and access SLURM commands (sinfo, squeue, srun, sbatch) to launch jobs on the compute nodes.

    Running benchmarks

    The steps necessary to start benchmarking any model include the following:

    1. Build a Docker container.
    2. Run the container on any machine with Docker to download and process the dataset and the checkpoint. This step can be done on any machine, in the cluster or not, and it generally doesn’t require a system with a GPU. Make sure the data is accessible by the compute nodes. Preferably the data is stored locally on the node; alternatively, it is accessible through a fast (parallel) file system.
    3. Launch the training and parse the logs.

    Llama 2 70B LoRA

    To run benchmarks for Llama 2 70B LoRA, follow the instructions in this section.

    Build the container

    • Clone the mlcommons/training_results_v5.0 GitHub repo.
    • cd NVIDIA/benchmarks/llama2_70b_lora/implementations/tyche_ngpu72_ngc25.04_nemo.
    • Docker build -t mlperf-nvidia:llama2_70b_lora-pyt. If you have a registry you would like to push the image to, add the registry name to the image name.

    Download the dataset and model

    This benchmark uses the GovReport dataset and a Hugging Face checkpoint. Both the dataset and the checkpoint require preprocessing to be used by NVIDIA NeMo. You need a Hugging Face token to download the checkpoint.

    To download and preprocess, do the following:

    # create a directory where the data will be stored
    mkdir </path/to/dataset>
    # start the docker container
    docker run -it --rm --gpus all --network=host --ipc=host --volume </path/to/dataset>:/data mlperf-nvidia:llama2_70b_lora-pyt
    # now you should be inside the container in the /workspace/ft-llm directory. run the download scripts
    python scripts/download_dataset.py --data_dir /data/gov_report  # download dataset
    python scripts/download_model.py --model_dir /data/model  # download preprocessed model checkpoint in NeMo format used for initialization; could take up to 30 minutes

    If the model download failed, you might need to export your HF_TOKEN before you call the download_model.py script:

    export HF_TOKEN=<your/huggingface/token>

    After conversion you should see the following files in the /data directory:

    /data
    ├── gov_report
    │   ├── train.npy
    │   └── validation.npy
    └── model
        ├── context
        │   ├── io.json
        │   ├── model.yaml
        │   └── nemo_tokenizer
        └── weights
            ├── common.pt
            ├── metadata.json
            ├── module.decoder.final_layernorm._extra_state
            ├── module.decoder.final_layernorm.weight
            ├── module.decoder.layers.mlp.linear_fc1._extra_state
            ├── module.decoder.layers.mlp.linear_fc1.layer_norm_weight
            ├── module.decoder.layers.mlp.linear_fc1.weight
            ├── module.decoder.layers.mlp.linear_fc2._extra_state
            ├── module.decoder.layers.mlp.linear_fc2.weight
            ├── module.decoder.layers.self_attention.core_attention._extra_state
            ├── module.decoder.layers.self_attention.linear_proj._extra_state
            ├── module.decoder.layers.self_attention.linear_proj.weight
            ├── module.decoder.layers.self_attention.linear_qkv._extra_state
            ├── module.decoder.layers.self_attention.linear_qkv.layer_norm_weight
            ├── module.decoder.layers.self_attention.linear_qkv.weight
            ├── module.embedding.word_embeddings.weight
            └── module.output_layer.weight

    You may exit the container at this point.

    Launch the benchmarking

    NVIDIA uses SLURM to launch benchmarks on the compute nodes. Two files are used to facilitate the job launch process:

    1. A configuration file (config_*.sh) that describes the hyperparameters of the model, including the number of nodes, walltime, and so on. Organizing these in a single file per submission enables easy configuration of the workload to run at desired scale with optimal hyperparameters.
    2. A fixed run.sub file that contains srun commands to launch the training, passing all the hyperparameters from the config to the Python script.

    To take a look at a typical config file, in this case config_GB200_18x4x1xtp1pp1cp8.sh, the name describes the size and the type of the system:

    • GB200: Designed to run on a GB200 machine
    • 18×4: A system setting that can be decoded as NNODES x NGPUS
      • NNODES: Number of GB200 nodes
      • NGPUS: Number of GPUs per node
    • x1xtp1pp1cp8 is a parallelization schema
      • x1 is GradientAccumulation, here equal to 1, meaning no GA
      • TP1: No TensorParallel
      • PP1: No PipelineParallel
      • CP8: 8-way ContextParallel

    This means a 72-GPU configuration will be used, running on a single GB200 NVL72 rack. The benchmark will run GA1TP1PP1CP8: Global batch size (GBS) = 9.

    Next, take a closer look at the content of the config file. The first part sources config_common.sh, which contains hyperparameters and optimization flags used by all configs. Some cases override a flag from config_common.sh. Set the max steps, learning rate, gradient accumulation (MINIBS), and the aforementioned parallelization schema.

    #!/bin/bash
    source $(dirname ${BASH_SOURCE[0]})/config_common.sh
     
    # hyperparameters
    export MAX_STEPS=800
    export LR=0.0005
    export MINIBS=1
    export TP=1
    export SP=0
    export CP=8

    Next is a section to add system-specific optimizations and override the common flags if needed.

    export LAYER_CUDA_GRAPH=0
    export MCORE_CUDA_GRAPH=1

    Next is a system-level setting to pass to SLURM.

    # system parameters
    export VBOOST_VALUE=0
    export DGXNNODES=18
    export DGXNGPU=4
    export WALLTIME_RUNANDTIME=10
    export WALLTIME=$((5 + ${NEXP:-1} * ($WALLTIME_RUNANDTIME + 5)))

    The configs are being tuned for a particular system size, both in terms of optimization flags (impacting performance) as well as hyperparameters (impacting convergence). It is possible to modify a given config to run on a system of a different size, but that requires careful consideration and is not guaranteed to be as performant as the original config.

    To start the actual training, you need to tell the script where the dataset/model is, where you want to have the logfile stored, which container you want to use, source the config files, and run the sbatch command:

    export DATADIR="</path/to/dataset>/gov_report"  # set your </path/to/dataset>
    export MODEL="</path/to/dataset>/model"  # set your </path/to/dataset>
    export LOGDIR="</path/to/output_logdir>"  # set the place where the output logs will be saved
    export CONT=mlperf-nvidia:llama2_70b_lora-pyt
    source config_<system>.sh  # select config and source it
    sbatch -N $DGXNNODES -t $WALLTIME run.sub  # you may be required to set --account and --partition here

    Parse the logs

    The logfile will contain a lot of output from the initialization and other info lines. The MLPerf-relevant lines start with the MLPerf logger prefix: :::MLLOG. There are a few interesting markers, as shown below.

    Initialization starts:

    :::MLLOG {"namespace": "", "time_ms": 1745066769306, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/workspace/ft-llm/train.py", "lineno": 327}}

    Here, you can see that the Python script has started. Below, you can see the hyperparameters that you have selected using the config file, along the default (immutable) ones. 

    After the initialization is completed, and the model is warmed up, the init_stop and run_start markers are printed:

    :::MLLOG {"namespace": "", "time_ms": 1745066917960, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 83}}
    :::MLLOG {"namespace": "", "time_ms": 1745066917961, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 83}}

    The run_start line marks the start of the timing clock. The following lines show the progress of the training, including evaluation. You can see that the evaluation loss is decreasing, marked by the eval_accuracy marker.

    When the evaluation accuracy (evaluation loss in reality) drops below the threshold of 0.925, the training stops and the run_stop marker is printed:

    :::MLLOG {"namespace": "", "time_ms": 1745067009412, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.92474365234375, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 303, "samples_count": 3024}}
    :::MLLOG {"namespace": "", "time_ms": 1745067009420, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "/usr/local/lib/python3.12/dist-packages/mlperf_common/callbacks/logging.py", "lineno": 106, "samples_count": 3024, "status": "success"}}

    If the benchmark fails to converge, the run_stop status will display as ‘aborted’. The MLPerf score is a difference between the timestamp of run_stop and run_start. In this case:

    Score [milliseconds] = (1745067009420 – 1745066917961) = 91459
    Score [minutes] = 91459/60000 = 1.524

    Keep in mind that because the convergence is nondeterministic, the final score has to be deduced from multiple runs, because the number of samples to converge may vary. Here, the benchmark converged at 3,072 samples, while on average, it should converge at around 3,100-3,200 samples.

    Llama 3.1 405B

    To run benchmarks for Llama 3.1 405B, follow the instructions in this section.

    Build the container

    • Clone the mlcommons/training_results_v5.0 GitHub repo.
    • cd NVIDIA/benchmarks/llama31_405b/implementations/tyche_ngpu512_ngc25.04_nemo.
    • docker build -t mlperf-nvidia:large_language_model-pyt. If you have a registry you would like to push the image to, feel free to add the registry name to the image name.

    Download the dataset and model

    For instructions on how to download the dataset and the tokenizer, see the Llama 3.1 405B reference README.

    Environment variable PREPROCESSED_PATH points to the preprocessed dataset. Downloaded files should end with .idx and .bin.

    c4-train.en_&lt;number&gt;_text_document where number belongs to 0~7.
    c4-validation-91205-samples

    Environment variable TOKENIZER_PATH points to the tokenizer used in this benchmark. Downloaded files include:

    special_tokens_map.json
    tokenizer.json
    tokenizer.model
    tokenizer.model.v1
    tokenizer_config.json

    You can clean up unnecessary files by running the cleanup script:

    bash scripts/cleanup.sh

    The final PREPROCESSED_PATH directory should contain:

    c4-train.en_6_text_document.bin
    c4-train.en_6_text_document.idx
    c4-train.en_7_text_document.bin
    c4-train.en_7_text_document.idx
    c4-validation-91205-samples.en_text_document.bin
    c4-validation-91205-samples.en_text_document.idx

    Checkpoint

    In the benchmarking region, resume training from the Meta official Hugging Face checkpoint. Refer to the instructions in the reference README to download the BF16 model checkpoint. Note that before you proceed, make sure that your current working directory is able to hold >1.5 TB of data.

    Assuming that you are running the download command under a given directory, with its location stored under LOAD_CHECKPOINTS_PATH environment variable. After the checkpoint is downloaded, you should be able to find a 405B folder which holds a context and weights subfolder under the current directory:

    &lt;LOAD_CHECKPOINTS_PATH&gt;
    └── 405b
        ├── context
        │   ├── nemo_tokenizer
        │   │   ├── special_tokens_map.json
        │   │   ├── tokenizer_config.json
        │   │   └── tokenizer.json
        │   ├── io.json
        │   └── model.yaml
        └── weights
            ├── __0_0.distcp
            ├── __0_1.distcp
            ├── .metadata
            ├── common.pt
            └── metadata.json

    Launch the benchmarking

    NVIDIA uses SLURM to launch benchmarks on the compute nodes. To facilitate the job launch process, similarly to Llama 2 70B LORA, two files are used:

    1. A configuration file (config_*.sh) that describes the hyperparameters of the model, including the number of nodes, walltime, and so on. Selecting a proper file enables you to easily configure the workload to run at desired scale with optimal hyperparameters.
    2. A fixed run.sub file that contains srun commands to launch the training, passing all the hyperparameters from the config to the Python script.

    To take a look at a typical config file, in this case config_GB200_128x4x112xtp4pp8cp2_cg_dplast.sh, the name describes the size and the type of the system:

    • GB200: Designed to run on a GB200 machine
    • 128×4: A system setting that can be decoded as NNODES x NGPUS
      • NNODES: Number of GB200 nodes
      • NGPUS: Number of GPUs per node
    • x112xtp4pp8cp2 is a parallelization schema:
      • x112 is GradientAccumulation, here equal to 112
      • TP4: 4-way TensorParallel
      • PP8: 8-way PipelineParallel
      • CP2: 2-way ContextParallel

    This means that a 512-GPU configuration will be used running on eight GB200 NVL72 racks, with 64 GPUs being used from each rack. The benchmark will run GA112TP4PP8CP2: Global batch size (GBS) = 896.

    To take a closer look at the content of the config file, the first part sources configs containing hyperparameters and optimization flags used by: all configs, configs using Blackwell GPUs, and configs using CUDA Graphs. In some cases, a flag from config_common.sh is overridden. Later, set the gradient accumulation (MINIBS), parallelization schema, micro batch size, the model size (frozen), and max steps.

    source $(dirname ${BASH_SOURCE[0]})/config_common.sh
    source $(dirname ${BASH_SOURCE[0]})/config_common_blackwell.sh
    source $(dirname ${BASH_SOURCE[0]})/config_common_cg.sh
     
    export MINIBS=112
    export TENSOR_MODEL_PARALLEL=4
    export PIPELINE_MODEL_PARALLEL=8
    export INTERLEAVED_PIPELINE=8
    export CONTEXT_PARALLEL=2
    export MICRO_BATCH_SIZE=1
    export MODEL_SIZE="405b"
    export MAX_STEPS=450

    Next, set performance optimization flags:

    export FP8_PARAM_GATHER=True
    export TP_COMM_OVERLAP=True
    export ASYM_PP_EMBED=True
    export ASYM_PP_LOSS=True
    export TP_PP_DP_MAPPING=True
     
    # Binding
    export BINDCMD="bindpcie --cpu=node"

    Next is a system-level settings to pass to SLURM:

    export DGXNNODES=128
    export DGXNGPU=4
    export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
     
    export WALLTIME_RUNANDTIME=180
    export WALLTIME=$((5 + ${NEXP:-1} * ($WALLTIME_RUNANDTIME + 5)))

    To start the actual training, you need to tell the script where the dataset/checkpoint is, where you want to have the logfile stored, which container you want to use, source the config files, and run the sbatch command:

    export PREPROC_DATA="/path/to/your/preprocessed_c4"
    export TOKENIZER="/path/to/your/tokenizer.model"
    export LOAD_CHECKPOINTS_PATH="/path/to/your/downloaded/checkpoint"
    export LOAD_CHECKPOINT="/load_checkpoints/405b"
    export LOGDIR=</path/to/output/dir# set the place where the output logs will be saved
    export CONT=mlperf-nvidia:large_language_model-pyt
    source config_GB200_128x4x112xtp4pp8cp2_cg_dplast.sh  # select config and source it
    sbatch -N ${DGXNNODES} --time=${WALLTIME} run.sub  # you may be required to set --account and --partition here

    Parse the logs

    The logfiles should largely resemble the Llama 2 70B LoRA logs. The target accuracy (evaluation loss) is 5.6. The training will stop once the target has been reached, and print the run_stop marker.

    Discuss (0)
    +10

    Tags

    人人超碰97caoporen国产