• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • NVIDIA Megatron-Core

    Train generative AI models from scratch at scale.

    NVIDIA Megatron-Core is a PyTorch-based open-source library to train gigantic models with unparalleled speed at large scale across thousands of GPUs. It features GPU-optimized training techniques cutting-edge system-level innovations, all accessible through composable APIs. Megatron-Core integrates seamlessly with NVIDIA NeMo? to provide an end-to-end, cloud-native solution to build, customize, and deploy large language models (LLMs).

    Download NowGet Started


    Explore Features and Benefits of NVIDIA Megatron-Core

    Parallelism Techniques

    The Megatron-Core library offers advanced model parallelism techniques, including tensor, sequence, pipeline, context, and MoE expert parallelism, for large-scale training.

    Users of Megatron-Core have the flexibility to combine different model parallel techniques to best suit their training workloads. Additionally, Megatron-Core offers memory-saving functionalities, including activation recomputation, distributed optimizers, and distributed checkpointing.

    Learn more in the API documentation.

    Customizable Building Blocks

    Megatron-Core offers customizable building blocks with modular and composable APIs. For transformer models, it offers attention mechanisms, normalization layers, embedding techniques, and more.

    With the Megatron-Core (Mcore) spec system, researchers can easily customize submodules in the PyTorch model definition at their desired abstraction level.

    Learn more about the MCore Spec system in documentation.

    Scalability and Training Resiliency

    Efficiently train large models at scale with training resiliency features such as fast distributed checkpointing.

    Learn more about how Megatron-Core enabled the training of a Nemotron-4 340B model at up to 6K+ H100 GPUs scale while achieving high per-GPU throughput.

    See details in this scalability benchmark.

    Cutting-Edge Research

    Leverage NVIDIA's cutting-edge research to stay at the forefront of distributed training by simply upgrading to the latest Megatron-Core.

    Pioneering large-model training since 2019, Megatron-Core continues to lead the innovations in large-scale training.

    Learn about some of the recent advancements in this blog.

    Train With Mixture-of-Expert

    Pretrain models with Mixture-of-Expert (MoE), a popular technique to achieve better accuracy without increasing computation resources.

    Megatron-Core offers performant functionality for both token dropless and token dropping use cases, with training speed optimizations.

    Learn more about MoE features in our repository.

    Beyond Transformers: Hybrid Models

    Megatron-Core expanded its support from Transformer-based models to hybrid models that combine state space models, state space dualities, and recurrent neural networks.

    Hybrid models have emerged as a compelling model architecture for sequence modeling tasks, as they overcome several limitations of attention.

    Learn more about training Mamba-based hybrid models in our paper and code example.

    Multimodal Training

    Train with multimodality using Megatron-Core’s parallelism techniques and its multi-modal data loader library, featuring determinism and reproducibility when blending multimodal datasets.

    Get started with the LLaVA (large language and vision assistant) training pipeline.


    Get Started

    Using Megatron-Core with NVIDIA NeMo

    NVIDIA NeMo is an end-to-end platform for developing custom generative AI—including LLMs and multimodal, vision, and speech AI. NeMo builds on NVIDIA NeMo and is suited for developers building enterprise-ready generative AI applications.

    Learn More

    Using Megatron-Core with NVIDIA Megatron-LM

    Megatron-LM is an open-source lightweight training framework with a native PyTorch training loop for exploring Megatron-Core. It’s easily customizable and is suitable for researchers who prefer minimum abstraction layers on top of Megatron-Core’s training techniques.

    Learn More

    World-Leading Training Speed and Scalability

    Megatron-Core is capable of efficiently training large language models with its parallelism techniques. In the weak scaling experiments below, with GPT models ranging from 2 billion to 462 billion parameters, Megatron-Core demonstrates superlinear scaling up to 6144 H100 GPUs.

    Model size
    Tensor MP size
    Pipeline MP size
    Data-parallel size
    Number of GPUs
    Batch size
    Per-GPU teraFLOP/s
    MFU
    2.1B
    1
    1
    16
    64
    16
    64
    256
    441
    412
    45%
    42%
    4.2B
    2
    1
    16
    64
    32
    128
    256
    431
    415
    44%
    42%
    8.3B
    4
    1
    16
    64
    64
    256
    256
    457
    426
    46%
    43%
    19.7B
    8
    1
    16
    64
    128
    512
    512
    439
    429
    44%
    43%
    41B
    8
    1
    32
    128
    256
    1024
    768
    469
    439
    47%
    44%
    78B
    8
    2
    32
    96
    512
    1536
    960
    446
    418
    45%
    42%
    148B
    8
    4
    24
    72
    768
    2304
    1152
    456
    432
    46%
    44%
    314B
    8
    8
    16
    48
    1024
    3072
    1152
    490
    464
    50%
    47%
    509B
    8
    20
    8
    24
    1280
    3840
    1440
    473
    426
    48%
    43%

    Aggregate Throughput (Weak Scaling)

    A graph showing NVIDIA Megatron-Core aggregate throughput for weak scaling

    Aggregate Throughput (Strong Scaling)

    In the strong scaling setting with a 177 billion parameter GPT-3 model using the same batch size of 1152 sequences throughout, Megatron-Core demonstrates near linear scaling from 96 to 4608 H100 GPUs.

    A graph showing NVIDIA Megatron-Core aggregate throughput for strong scaling
    See Detailed Benchmark

    Resources

    Use Megatron to Train Large Models at Unparalleled Speed

    Get Started

    人人超碰97caoporen国产