NCCL – NVIDIA Technical Blog

NCCL – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-10T19:19:07Z http://www.open-lab.net/blog/feed/ Kamil Iskra <![CDATA[Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26]]> http://www.open-lab.net/blog/?p=102206 2025-06-26T18:54:53Z 2025-06-18T16:46:35Z

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...]]>

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...

cube-abstract

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVIDIA NVLink, or networking. It uses advanced topology detection, optimized communication graphs��

]]> 0 Berkin Kartal <![CDATA[AI Fabric Resiliency and Why Network Convergence Matters]]> http://www.open-lab.net/blog/?p=98574 2025-05-29T19:05:04Z 2025-05-14T16:20:00Z

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication...]]>

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication... Typical data center interconnection schema for Clos fabric.

Typical data center interconnection schema for Clos fabric.

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication pipeline, which directly increases latency and disrupts the synchronization between GPUs. This can degrade the performance of collective operations such as all-reduce or broadcast, where every GPU��s participation is required before progressing.

]]> 0 Ben Williams <![CDATA[Networking Reliability and Observability at Scale with NCCL 2.24]]> http://www.open-lab.net/blog/?p=96731 2025-04-23T00:32:27Z 2025-03-13T16:30:00Z

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking....]]>

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking....

cubes

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs��

]]> 0 Sylvain Jeaugey <![CDATA[New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23]]> http://www.open-lab.net/blog/?p=95412 2025-04-23T02:48:19Z 2025-01-31T22:47:37Z

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...]]>

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...

cubes

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs��

]]> 0 Scot Schultz <![CDATA[Advancing Performance with NVIDIA SHARP In-Network Computing]]> http://www.open-lab.net/blog/?p=90863 2024-10-31T18:36:43Z 2024-10-25T20:39:38Z

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to...]]>

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to... Picture of servers in a data center.

Picture of servers in a data center.

AI and scientific computing applications are great examples of distributed computing problems. The problems are too large and the computations too intensive to run on a single machine. These computations are broken down into parallel tasks that are distributed across thousands of compute engines, such as CPUs and GPUs. To achieve scalable performance, the system relies on dividing workloads��

]]> 0 Giuseppe Congiu <![CDATA[Memory Efficiency, Faster Initialization, and Cost Estimation with NVIDIA Collective Communications Library 2.22]]> http://www.open-lab.net/blog/?p=87077 2024-09-19T19:30:36Z 2024-09-17T00:31:08Z

For the past few months, the NVIDIA Collective Communications Library (NCCL) developers have been working hard on a set of new library features and bug fixes....]]>

For the past few months, the NVIDIA Collective Communications Library (NCCL) developers have been working hard on a set of new library features and bug fixes.... Decorative image of a cube of green cubes, surrounded by other cubes on a dark background.

Decorative image of a cube of green cubes, surrounded by other cubes on a dark background.

For the past few months, the NVIDIA Collective Communications Library (NCCL) developers have been working hard on a set of new library features and bug fixes. In this post, we discuss the details of the NCCL 2.22 release and the pain points addressed. NVIDIA Magnum IO NCCL is a library designed to optimize inter-GPU and multi-node communication, crucial for efficient parallel computing��

]]> 0 Akhil Langer <![CDATA[Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0]]> http://www.open-lab.net/blog/?p=88550 2024-09-19T19:34:01Z 2024-09-06T20:30:09Z

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on...]]>

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on...

cube-graphic

NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on OpenSHMEM, NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.

]]> 0 Tian Cao <![CDATA[Perception Model Training for Autonomous Vehicles with Tensor Parallelism]]> http://www.open-lab.net/blog/?p=81464 2024-05-02T19:01:07Z 2024-04-27T05:00:00Z

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models...]]>

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models...

car-on-road-cityscape

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models is large. Existing methods for reducing memory usage often result in additional computational overheads or imbalanced workloads. This post describes joint research between NVIDIA and NIO, a developer of smart electric vehicles.

]]> 0 Rob Armstrong <![CDATA[CUDA Toolkit 12.4 Enhances Support for NVIDIA Grace Hopper and Confidential Computing]]> http://www.open-lab.net/blog/?p=79119 2024-08-28T17:32:44Z 2024-03-06T19:55:00Z

The latest release of CUDA Toolkit, version 12.4, continues to push accelerated computing performance using the latest NVIDIA GPUs. This post explains the new...]]>

The latest release of CUDA Toolkit, version 12.4, continues to push accelerated computing performance using the latest NVIDIA GPUs. This post explains the new...

cuda-abstract-graphic

The latest release of CUDA Toolkit, version 12.4, continues to push accelerated computing performance using the latest NVIDIA GPUs. This post explains the new features and enhancements included in this release: CUDA and the CUDA Toolkit software provide the foundation for all NVIDIA GPU-accelerated computing applications in data science and analytics, machine learning��

]]> 0 Brian Sparks <![CDATA[Networking for Data Centers and the Era of AI]]> http://www.open-lab.net/blog/?p=71474 2023-11-02T18:14:42Z 2023-10-12T16:30:00Z

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications....]]>

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications....

networking-data-center-ai

Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications. However, data centers have evolved in recent years to keep up with advancements in technology and the surging demand for AI-driven computing. This post explores the pivotal role that networking plays in shaping the future of data centers��

]]> 0 John F. Kim <![CDATA[OCI Accelerates HPC, AI, and Database Using RoCE and NVIDIA ConnectX]]> http://www.open-lab.net/blog/?p=68265 2023-08-10T17:11:25Z 2023-07-19T19:00:00Z

Oracle is one of the top cloud service providers in the world, supporting over 22,000 customers and reporting revenue of nearly $4 billion per quarter and...]]>

Oracle is one of the top cloud service providers in the world, supporting over 22,000 customers and reporting revenue of nearly $4 billion per quarter and...

data-center

Oracle is one of the top cloud service providers in the world, supporting over 22,000 customers and reporting revenue of nearly $4 billion per quarter and annual growth of greater than 40%. Oracle Cloud Infrastructure (OCI) is growing at an even faster rate and offers a complete cloud infrastructure for every workload. Having added 11 regions in the last 18 months, OCI currently offers 41��

]]> 0 Peter Rizk <![CDATA[Turbocharging Generative AI Workloads with NVIDIA Spectrum-X Networking Platform]]> http://www.open-lab.net/blog/?p=65131 2024-10-11T20:02:19Z 2023-05-29T07:00:00Z

Large language models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and...]]>

Large language models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and... Ethernet switches

Ethernet switches

Large language models (LLMs) and AI applications such as ChatGPT and DALL-E have recently seen rapid growth. Thanks to GPUs, CPUs, DPUs, high-speed storage, and AI-optimized software innovations, AI is now widely accessible. You can even deploy AI in the cloud or on-premises. Yet AI applications can be very taxing on the network, and this growth is burdening CPU and GPU servers��

]]> 0 Amit Katz <![CDATA[Navigating Generative AI for Network Admins]]> http://www.open-lab.net/blog/?p=63314 2023-06-01T19:08:41Z 2023-05-25T16:00:00Z

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: Automation of repetitive tasks: This...]]>

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: Automation of repetitive tasks: This...

NVIDIA-DataCenter-Lifestyle-2023-7009

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: However, AI is no replacement for the know-how of an experienced network admin. AI is meant to augment your capabilities, like a virtual assistant. So, AI may become your best friend, but generative AI is also a new data center workload that brings a new paradigm��

]]> 0 CJ Newburn <![CDATA[Accelerating IO in the Modern Data Center: Network IO]]> http://www.open-lab.net/blog/?p=21733 2022-08-21T23:40:44Z 2020-10-20T19:13:11Z

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern...]]>

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern...

gdr-direct-connection-for-gpus

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern data center. The first post in this series introduced the Magnum IO architecture and positioned it in the broader context of CUDA, CUDA-X, and vertical application domains. Of the four major components of the architecture��

]]> 1 Sylvain Jeaugey <![CDATA[Massively Scale Your Deep Learning Training with NCCL 2.4]]> http://www.open-lab.net/blog/?p=13452 2022-08-21T23:39:19Z 2019-02-04T15:00:48Z

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep...]]>

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep...

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, and multi-machine training. Allreduce operations, used to sum gradients over multiple GPUs, have usually been implemented using rings [1] [2] to achieve full bandwidth. The downside of rings is��

]]> 1 Sylvain Jeaugey <![CDATA[Scaling Deep Learning Training with NCCL]]> http://www.open-lab.net/blog/?p=12093 2022-08-21T23:39:08Z 2018-09-26T17:30:03Z

NVIDIA Collective Communications Library (NCCL)?provides optimized implementation of inter-GPU communication operations, such as allreduce and variants....]]>

NVIDIA Collective Communications Library (NCCL)?provides optimized implementation of inter-GPU communication operations, such as allreduce and variants....

dgx-2_square

NVIDIA Collective Communications Library (NCCL) provides optimized implementation of inter-GPU communication operations, such as allreduce and variants. Developers using deep learning frameworks can rely on NCCL��s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. NCCL is optimized for high bandwidth and��

]]> 1 Brad Nemire <![CDATA[NVIDIA Deep Learning SDK Update for Volta Now Available]]> https://news.www.open-lab.net/?p=8942 2022-08-21T23:44:00Z 2017-08-09T04:07:59Z

At GTC 2017, NVIDIA announced Volta optimized updates to the NVIDIA Deep Learning SDK. Today, we��re making these updates available as free downloads to...]]>

At GTC 2017, NVIDIA announced Volta optimized updates to the NVIDIA Deep Learning SDK. Today, we��re making these updates available as free downloads to...

cudnn nccl featured

At GTC 2017, NVIDIA announced Volta optimized updates to the NVIDIA Deep Learning SDK. Today, we��re making these updates available as free downloads to members of the NVIDIA Developer Program. Deep learning frameworks using NVIDIA cuDNN 7 and NCCL 2 can take advantage of new features and performance benefits of the Volta architecture. cuDNN 7 NCCL 2 Learn more about Volta��s Tensor��

]]> 0 Nathan Luehr <![CDATA[Fast Multi-GPU collectives with NCCL]]> http://www.open-lab.net/blog/parallelforall/?p=6598 2022-08-21T23:37:50Z 2016-04-07T15:27:54Z

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in...]]>

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in... Figure 5: Ring order of GPUs in PCIe tree.

Figure 5: Ring order of GPUs in PCIe tree.

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor��

]]> 14 ��˳��97caoporen��