• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>

    NVSHMEM? is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA? streams.

    Get Started

    Existing communication models, such as Message-Passing Interface (MPI), orchestrate data transfers using the CPU. In contrast, NVSHMEM uses asynchronous, GPU-initiated data transfers, eliminating synchronization overheads between the CPU and the GPU.

    Efficient, Strong Scaling

    NVSHMEM enables long-running kernels that include both communication and computation, reducing overheads that can limit an application’s performance when strong scaling.

    Low Overhead

    One-sided communication primitives reduce overhead by allowing the initiating process or GPU thread to specify all information required to complete a data transfer. This low-overhead model enables many GPU threads to communicate efficiently.

    Naturally Asynchronous

    Asynchronous communications make it easier for programmers to interleave computation and communication, thereby increasing overall application performance.

    What's New in NVSHMEM 2.9.0

    • Improvements to the CMake build system. CMake is now the default build system and the Makefile build system is deprecated.
    • Added loadable network transport modules.
    • NVSHMEM device code can now be inlined to improve performance by enabling NVSHMEM_ENABLE_ALL_DEVICE_INLINING when building the NVSHMEM library.
    • Improvements to collective communication performance.
    • Updated libfabric transport to fragment messages larger than the maximum length supported by the provider.
    • Improvements to IBGDA transport, including large message support, user buffer registration, blocking g/get/amo performance, CUDA module support, and several bugfixes.
    • Introduced ABI compatibility for bootstrap modules. This release is backawards compatible with the ABI introduced in NVSHMEM 2.8.0.
    • Added NVSHMEM_BOOTSTRAP_*_PLUGIN environment variables that can be used to override the default filename used when opening each bootstrap plugin.
    • Improved error handling for GDRCopy.
    • Added a check to detect when the same number of PEs is not run on all nodes.
    • Added a check to detect availability of nvidia_peermem kernel module.
    • Reduced internal stream synchronizations to fix a compatibility bug with CUDA graph capture.

    Key Features

    • Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
    • Includes a low-overhead, in-kernel communication API for use by GPU threads
    • Includes stream-based and CPU-initiated communication APIs
    • Supports x86 and POWER9 processors
    • Is interoperable with MPI and other OpenSHMEM implementations

    NVSHMEM Advantages

    Increase Performance

    Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges.

    In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.

    Efficient Strong-Scaling on Sierra Supercomputer

    Efficient Strong-Scaling on NVIDIA DGX SuperPOD

    Accelerate Time to Solution

    Reducing the time to solution for high-performance, scientific computing workloads generally requires a strong-scalable application. QUDA is a library for lattice quantum chromodynamics (QCD) on GPUs, and it’s used by the popular MIMD Lattice Computation (MILC) and Chroma codes.

    NVSHMEM-enabled QUDA avoids CPU-GPU synchronization for communication, thereby reducing critical-path latencies and significantly improving strong-scaling efficiency.

    Watch the GTC 2020 Talk

    Simplify Development

    The conjugate gradient (CG) method is a popular numerical approach to solving systems of linear equations, and CGSolve is an implementation of this method in the Kokkos programming model. The CGSolve kernel showcases the use of NVSHMEM as a building block for higher-level programming models like Kokkos.

    NVSHMEM enables efficient multi-node and multi-GPU execution using Kokkos global array data structures without requiring explicit code for communication between GPUs. As a result, NVSHMEM-enabled Kokkos significantly simplifies development compared to using MPI and CUDA.

    Productive Programming of Kokkos CGSolve

    Ready to start developing with NVSHMEM?

    Get Started