• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Data Center / Cloud

    Turbocharge LLM Training Across Long-Haul Data Center Networks with NVIDIA Nemo Framework

    A multi-data center illustration.

    Multi-data center training is becoming essential for AI factories as pretraining scaling fuels the creation of even larger models, leading the demand for computing performance to outpace the capabilities of a single facility. By distributing workloads across multiple data centers, organizations can overcome limitations in power, cooling, and space, enabling the training of even larger, more accurate models with better efficiency. 

    The latest release of NVIDIA NeMo Framework 25.02 and NVIDIA Megatron-Core 0.11.0 brings new capabilities for multi-data center large language model (LLM) training. This update enables users to scale training beyond the physical and operational limits of a single data center, unlocking unprecedented efficiency and performance by harnessing the combined power of multiple sites. 

    In this post, we’ll cover how NeMo Framework and Megatron-Core are revolutionizing multi-data center training with these key advances:

    • High efficiency across sites: Achieving 96% scaling efficiency by effectively distributing training across thousands of NVIDIA GPUs in geographically separated data centers.
    • Advanced communication strategies: Overcoming inter-data center latency using hierarchical orchestration and gradient synchronization.
    • Real-world success: Validating these innovations through the efficient training of a LLM with 340B parameters, paving the way for next-generation AI supercomputing.

    Why multi-data center training is hard

    Training trillion-parameter models isn’t just about adding more GPUs—it’s about overcoming key infrastructure challenges that influence cost and performance. When managing compute across multiple data centers, developers must contend with high inter-region latency (often 20? milliseconds or more) that can introduce performance bottlenecks during gradient updates and model synchronization during large-scale LLM training. Addressing these issues enables distributed LLM training architectures that boost performance, maximize hardware and energy efficiency, reduce infrastructure strain, and enable AI projects to scale across geographic regions without centralized resource bottlenecks.

    Key challenges include:

    • High-latency and bandwidth limitations: Communication between data centers can be slow and constrained, reducing training efficiency.  
    • Synchronization: Keeping distributed data centers aligned demands sophisticated protocols and techniques.
    • Traffic management: Minimizing data flow over long-haul networks is essential to maintaining low-latency and high-throughput.

    Enabling high-efficiency multi-data center training

    To overcome the challenges of multi-data center LLM training, NeMo Framework 25.02 and Megatron-Core 0.11.0 introduce four key innovations:

    These capabilities optimize communication, orchestration, and compute efficiency across geographically separated sites, ensuring scalable, high-performance training of the world’s largest AI models.

    Adaptive resource orchestration

    Adaptive resource orchestration is a distributed training strategy that leverages the hierarchy of latency and bandwidth between various GPUs within and across clusters. It selects and prioritizes parallelism methods, which are resilient to communication delays and bandwidth constraints, making it ideal for cross-data center development deployments. In these setups, model-parallel techniques—such as tensor, context, and expert parallelism—demand frequent, high-bandwidth synchronization that doesn’t suit the high-latency environment between data centers. Instead, data parallelism and pipeline parallelism techniques are favored due to:

    • Latency tolerance: Data parallelism’s batched gradient aggregation accommodates the larger delays inherent in inter-data center communications.
    • Bandwidth efficiency: Hierarchical reduction patterns in data parallelism consolidate cross-data center traffic, significantly lowering bandwidth requirements.
    • Hardware agnosticism: Both methods abstract hardware differences across sites through standardized sharding.

    By aligning the choice of parallelism technique with the network’s constraints, adaptive resource orchestration reduces the inter-data center bandwidth requirement per GPU to roughly 1/N of the intra-data center demand, yielding substantial efficiency gains over traditional flat approaches.

    Hierarchical all-reduce

    HAR involves synchronizing gradients in three steps: 

    1. ReduceScatter within each group or data center, 
    2. AllReduce across data centers.
    3. AllGather within each data center. 

    This method minimizes traffic over long-haul networks by first optimizing inter-data center communication, ensuring high throughput and low latency.  Figure 1 explains how HAR works.

    Animation showing how AllReduce works, then how Hierarchical AllReduce (HAR) works, starting with ReduceScatter within each local data center, then an AllReduce across data centers, and finally an AllGather within each local data center. HAR minimizes traffic over long-haul networks by first optimizing inter-data center communication, ensuring high throughput and low latency.
    Figure 1. HAR explained

    Distributed optimizer architecture

    The partial-data parallel distributed optimizer enhances efficiency through localized weight updates and gradient reductions within each data center, followed by a single synchronized gradient reduction across sites, eliminating redundant optimizer state duplication while minimizing cross-data center communication. By sharding optimizer states within data centers (rather than globally) and replicating optimizer instances across sites, the architecture preserves memory efficiency at scale while reducing inter-data center traffic.

    Chunked inter-data center communications

    By splitting communications into chunks and overlapping those chunks with computation, inter-data center communication can be hidden behind intra-data center operations. This technique ensures that the training process remains efficient even at large scales, enabling high tolerance to latency between sites. 

    Multi-data center training of NVIDIA Nemotron-4 340B 

    Recently, we had an opportunity to run a large-scale training of Nemotron-4 340B. To set the baseline, the LLM was trained using a single data center with 3,072 NVIDIA GPUs.  

    Next, the model was trained across two data centers, located approximately 1,000 kilometers apart, to demonstrate the effectiveness of these new features. As shown in Table 1, the setup achieved over 96% of baseline throughput at a 3,072 GPU scale (1,500 GPUs in each data center), with independent inter- and intra-data center communications overlapping to maximize efficiency. By leveraging the capabilities of NeMo Framework and Megatron-Core, the training process achieved remarkable efficiency and scalability, setting a new standard for LLM development.

    MetricSingle Data Center (ORD)Multi-Data Center (ORD + IAD)
    Total GPUs3,072 GPUs3,072 GPUs
    (1,536 in ORD, 1,536 in IAD)
    GPU Nodes375 nodes (8 GPUs per node)375 nodes (8 GPUs per node)
    Data Center LocationsOracle Cloud Infrastructure (OCI) – Chicago, IL (ORD)OCI – Chicago, IL (ORD) and Ashburn, VA (IAD)
    Distance Between Data CentersN/AApproximately 1,000 km
    Measured Round-Trip LatencyN/A21 milliseconds
    Scaling EfficiencyBaseline (100%)Over 96% compared to single-site baseline
    Model FLOPS Utilization (MFU)51%49%
    Training ModelNemotron-4 340BNemotron-4 340B
    Table 1. A comparison of the baseline compared to multi-data center training of Nemotron-4 340B

    Unleashing supercomputing across multiple facilities

    Multi-data center training is emerging as a transformative approach in AI factories, laying the groundwork for distributed systems that span several buildings and even regions. By integrating advanced networking and synchronization technologies, this approach coordinates vast arrays of GPUs across distinct facilities, ensuring that complex training tasks can run concurrently and seamlessly. 

    NVIDIA GPU data center platforms, including low-latency networking solutions, and AI software stack, enable unprecedented parallelism. This full-stack platform paves the way for supercomputers that could eventually harness more than 500,000 GPUs across multiple data centers. This architecture not only scales computational power but also enhances reliability and flexibility by dynamically balancing workloads across multiple sites, reducing bottlenecks and optimizing energy efficiency. 

    Get started today

    Support for training LLMs across multiple data centers is built into Megatron-Core, which is deeply integrated into NVIDIA NeMo Framework, an end-to-end platform for developing custom generative AI, including large language models (LLMs), vision language models (VLMs), retrieval models, video models, and speech AI—anywhere. It incorporates Megatron-Core for large-scale LLM training and provides a more extensive set of tools for building state-of-the-art custom generative AI, multimodal, and speech AI agentic systems. To learn more, see the NVIDIA NeMo Framework documentation and the GitHub examples repository.

    Discuss (0)
    +8

    Tags

    人人超碰97caoporen国产