NVSHMEM

NVSHMEM? is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA? streams.

Get Started

Existing communication models, such as Message-Passing Interface (MPI), orchestrate data transfers using the CPU. In contrast, NVSHMEM uses asynchronous, GPU-initiated data transfers, eliminating synchronization overheads between the CPU and the GPU.

Efficient, Strong Scaling

NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an applicationâ€™s performance when strong scaling.

Low Overhead

One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.

Naturally Asynchronous

Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.

What's New in NVSHMEM 3.2

Enables platform support for Blackwell SM100 architecture on NVLINK5 connected B200-based systems.
Added one-shot and two-shot NVLINK SHARP (NVLS) allreduce algorithms for half-precision (float16, bfloat16) and full-precision (float32) datatypes on NVLINK4 and NVLINK5 enabled platforms.
Added automatic multi-SM accelerated on-stream collectives (fcollect, reducescatter, reduce) to improve NVLINK bandwidth on NVLINK4 and NVLINK5 enabled platforms for achieving 8x/16x speedup for medium to large-message size (>=1MB) compared to prior implementations.
Added a new LLVM IR-complaint bitcode device library to support MLIR-compliant compiler toolchain integration on new and upcoming Python DSLs (Triton, Mosaic, Numba, etc). This feature enhances perftest to enable testing of LLVM IR-compliant bitcode device library using runtime configuration NVSHMEM_TEST_CUBIN_LIBRARY.
Enhanced NVSHMEM host/device side collective and pt-to-pt operations by introducing a new command-line interface tool to improve runtime tunability of test parameters such as message size, datatype, reduce op, iterations, etc.
Improved heuristics for the automatic selection of on-stream NVLS collectives for fcollect, reducescatter, and reduce operations that span NVLINK-connected, GPU-based systems.
Eliminates dynamic link-time dependency on MPI and SHMEM on perftest and examples and replaces them with the dynamic load-time capability.
Added new example for ring based allreduce operation when GPUs are connected via remote interconnects (IB/RoCE/EFA/etc).
Added new example for fused alltoall and allgather operation (common in Mixure of Experts model) when GPUs are connected via P2P interconnects (NVLINK).
Fixed several minor bugs and memory leaks.

Key Features

Combines the memory of multiple GPUs into a partitioned global address space thatâ€™s accessed through NVSHMEM APIs
Includes a low-overhead, in-kernel communication API for use by GPU threads
Includes stream-based and CPU-initiated communication APIs
Supports x86 and Arm processors
Is interoperable with MPI and other OpenSHMEM implementations

NVSHMEM Advantages

Increase Performance

Convolution is a compute-intensive kernel thatâ€™s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.

In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratoryâ€™s Sierra supercomputer.

Efficient Strong-Scaling on Sierra Supercomputer

NVSHMEM

MPI

Efficient Strong-Scaling on NVIDIA DGX SuperPOD

DGX SuperPOD NVSHMEM

DGX SuperPOD MPI

Accelerate Time to Solution

Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and itâ€™s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.

NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.

Watch the GTC 2020 Talk

Simplify Development

The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.

NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.

Productive Programming of Kokkos CGSolve

Computation Code

Communication Code

Resources

Users of NVSHMEM:
NVSHMEM Blogs:
Introductory Webinar
NVSHMEM Documentation
NVSHMEM Best Practices Guide
NVSHMEM API Documentation
OpenSHMEM Specification
NVSHMEM Developer Forum
For questions or to provide feedback, please contact nvshmem@nvidia.com
Related libraries and software:

Ready to start developing with NVSHMEM?

Get Started