Kaiming Ouyang – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-04-23T00:32:27Z http://www.open-lab.net/blog/feed/ Kaiming Ouyang <![CDATA[Networking Reliability and Observability at Scale with NCCL 2.24]]> http://www.open-lab.net/blog/?p=96731 2025-04-23T00:32:27Z 2025-03-13T16:30:00Z The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking....]]>

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs…

Source

]]>
���˳���97caoporen����