Multi-GPU – NVIDIA Technical Blog

Multi-GPU – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-03T22:20:47Z http://www.open-lab.net/blog/feed/ Jamil Semaan <![CDATA[How to Work with Data Exceeding VRAM in the Polars GPU Engine]]> http://www.open-lab.net/blog/?p=102715 2025-06-30T21:20:48Z 2025-06-27T17:00:00Z

In high-stakes fields such as quant finance, algorithmic trading, and fraud detection, data practitioners frequently need to process hundreds of gigabytes (GB)...]]>

In high-stakes fields such as quant finance, algorithmic trading, and fraud detection, data practitioners frequently need to process hundreds of gigabytes (GB)...

nvidia-polars-logos

In high-stakes fields such as quant finance, algorithmic trading, and fraud detection, data practitioners frequently need to process hundreds of gigabytes (GB) of data to make quick, informed decisions. Polars, one of the fastest-growing data processing libraries, meets this need with a GPU engine powered by NVIDIA cuDF that accelerates compute-bound queries that are common in these fields.

]]> 0 Bo Dong <![CDATA[NVIDIA cuPyNumeric 25.03 Now Fully Open Source with PIP and HDF5 Support]]> http://www.open-lab.net/blog/?p=99089 2025-05-15T19:08:44Z 2025-04-23T19:26:07Z

NVIDIA cuPyNumeric is a library that aims to provide a distributed and accelerated drop-in replacement for NumPy built on top of the Legate framework. It brings...]]>

NVIDIA cuPyNumeric is a library that aims to provide a distributed and accelerated drop-in replacement for NumPy built on top of the Legate framework. It brings...

composite-images

NVIDIA cuPyNumeric is a library that aims to provide a distributed and accelerated drop-in replacement for NumPy built on top of the Legate framework. It brings zero-code-change scaling to multi-GPU and multinode (MGMN) accelerated computing. cuPyNumeric 25.03 is a milestone update that introduces powerful new capabilities and enhanced accessibility for users and developers alike��

]]> 0 Hao Wang <![CDATA[Petabyte-Scale Video Processing with NVIDIA NeMo Curator on NVIDIA DGX Cloud]]> http://www.open-lab.net/blog/?p=97031 2025-03-20T17:07:03Z 2025-03-18T19:22:51Z

With the rise of physical AI, video content generation has surged exponentially. A single camera-equipped autonomous vehicle can generate more than 1 TB of...]]>

With the rise of physical AI, video content generation has surged exponentially. A single camera-equipped autonomous vehicle can generate more than 1 TB of... NeMo Video Curator icon in a workflow diagram.

NeMo Video Curator icon in a workflow diagram.

With the rise of physical AI, video content generation has surged exponentially. A single camera-equipped autonomous vehicle can generate more than 1 TB of video daily, while a robotics-powered manufacturing facility may produce 1 PB of data daily. To leverage this data for training and fine-tuning world foundation models (WFMs), you must first process it efficiently.

]]> 4 Tom Balough <![CDATA[Enhance Your Training Data with New NVIDIA NeMo Curator Classifier Models]]> http://www.open-lab.net/blog/?p=94447 2024-12-19T23:08:12Z 2024-12-19T23:08:08Z

Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for...]]>

Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for...

ai-model-graphic

Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for fine-tuning and pretraining generative AI models. Their value lies in enhancing data quality by filtering out low-quality or toxic data, ensuring only clean and relevant information feeds downstream processes. Beyond filtering��

]]> 0 Efrat Shabtai <![CDATA[Accelerating Google��s QPU Development with New Quantum Dynamics Capabilities]]> http://www.open-lab.net/blog/?p=91847 2025-05-02T21:40:16Z 2024-11-18T18:30:00Z

Quantum dynamics describes how complex quantum systems evolve in time and interact with their surroundings. Simulating quantum dynamics is extremely difficult...]]>

Quantum dynamics describes how complex quantum systems evolve in time and interact with their surroundings. Simulating quantum dynamics is extremely difficult... Google QPU development enabling dynamics simulations

Google QPU development enabling dynamics simulations

Quantum dynamics describes how complex quantum systems evolve in time and interact with their surroundings. Simulating quantum dynamics is extremely difficult yet critical for understanding and predicting the fundamental properties of materials. This is of particular importance in the development of quantum processing units (QPUs), where quantum dynamics simulations enable QPU developers to��

]]> 1 Mohammad Almasri <![CDATA[Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries]]> http://www.open-lab.net/blog/?p=88566 2024-09-19T19:32:22Z 2024-09-10T16:30:00Z

In the realm of high-performance computing (HPC), NVIDIA has continually advanced HPC by offering its highly optimized NVIDIA High-Performance Conjugate...]]>

In the realm of high-performance computing (HPC), NVIDIA has continually advanced HPC by offering its highly optimized NVIDIA High-Performance Conjugate... Decorative image of light fields in green, purple, and blue.

Decorative image of light fields in green, purple, and blue.

In the realm of high-performance computing (HPC), NVIDIA has continually advanced HPC by offering its highly optimized NVIDIA High-Performance Conjugate Gradient (HPCG) benchmark program as part of the NVIDIA HPC benchmark program collection. We now provide the NVIDIA HPCG benchmark program in the /NVIDIA/nvidia-hpcg GitHub repo, using its high-performance math libraries, cuSPARSE��

]]> 0 Brian Slechta <![CDATA[NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference]]> http://www.open-lab.net/blog/?p=87063 2024-08-22T18:25:32Z 2024-08-12T14:00:00Z

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements...]]>

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements... Decorative image of linked modules.

Decorative image of linked modules.

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today��s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are simultaneously important. Even if a large��

]]> 0 Tian Cao <![CDATA[Perception Model Training for Autonomous Vehicles with Tensor Parallelism]]> http://www.open-lab.net/blog/?p=81464 2024-05-02T19:01:07Z 2024-04-27T05:00:00Z

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models...]]>

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models...

car-on-road-cityscape

Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models is large. Existing methods for reducing memory usage often result in additional computational overheads or imbalanced workloads. This post describes joint research between NVIDIA and NIO, a developer of smart electric vehicles.

]]> 0 Gareth Sylvester-Bradley <![CDATA[Streamline Live Media Application Development with New Features in NVIDIA Holoscan for Media]]> http://www.open-lab.net/blog/?p=79072 2024-04-25T19:14:28Z 2024-03-12T20:00:00Z

NVIDIA Holoscan for Media is a software-defined platform for building and deploying applications for live media. Recent updates introduce a user-friendly...]]>

NVIDIA Holoscan for Media is a software-defined platform for building and deploying applications for live media. Recent updates introduce a user-friendly...

woman-sitting-desktop-computer

NVIDIA Holoscan for Media is a software-defined platform for building and deploying applications for live media. Recent updates introduce a user-friendly developer interface and new capabilities for application deployment to the platform. Holoscan for Media now includes Helm Dashboard, which delivers an intuitive user interface for orchestrating and managing Helm charts.

]]> 0 Tanya Lenz <![CDATA[Just Released: NVIDIA HPC SDK 23.9]]> http://www.open-lab.net/blog/?p=71163 2023-11-02T18:14:44Z 2023-10-05T20:00:00Z

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.]]>

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.

networking-infiniband-dpu-for-hpc

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.

]]> 0 Gareth Sylvester-Bradley <![CDATA[Software-Defined Broadcast with NVIDIA Holoscan for Media]]> http://www.open-lab.net/blog/?p=70826 2024-03-13T19:05:53Z 2023-09-14T19:00:00Z

The broadcast industry is undergoing a transformation in how content is created, managed, distributed, and consumed. This transformation includes a shift from...]]>

The broadcast industry is undergoing a transformation in how content is created, managed, distributed, and consumed. This transformation includes a shift from...

Holoscan-media-Techblog-1480x830-1

The broadcast industry is undergoing a transformation in how content is created, managed, distributed, and consumed. This transformation includes a shift from traditional linear workflows bound by fixed-function devices to flexible and hybrid, software-defined systems that enable the future of live streaming. Developers can now apply to join the early access program for NVIDIA Holoscan for��

]]> 3 Jiwei Liu <![CDATA[Unlocking Multi-GPU Model Training with Dask XGBoost]]> http://www.open-lab.net/blog/?p=70261 2023-09-07T18:30:59Z 2023-09-07T17:09:18Z

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient...]]>

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient... A diagram representing multi-GPU training.

A diagram representing multi-GPU training.

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient gradient-boosting framework that��s been widely adopted due to its speed and performance for large tabular data. Using multiple GPUs should theoretically provide a significant boost in computational power, resulting in faster model��

]]> 1 Chintan Patel <![CDATA[Streamline Generative AI Development with NVIDIA NeMo on GPU-Accelerated Google Cloud]]> http://www.open-lab.net/blog/?p=69973 2024-11-20T23:03:46Z 2023-08-29T17:00:00Z

Generative AI has become a transformative force of our era, empowering organizations spanning every industry to achieve unparalleled levels of productivity,...]]>

Generative AI has become a transformative force of our era, empowering organizations spanning every industry to achieve unparalleled levels of productivity,... A collection of images showing generative AI examples.

A collection of images showing generative AI examples.

Generative AI has become a transformative force of our era, empowering organizations spanning every industry to achieve unparalleled levels of productivity, elevate customer experiences, and deliver superior operational efficiencies. Large language models (LLMs) are the brains behind generative AI. Access to incredibly powerful and knowledgeable foundation models, like Llama and Falcon��

]]> 0 Eyal Hirsch <![CDATA[GPU Integration Propels Data Center Efficiency and Cost Savings for Taboola]]> http://www.open-lab.net/blog/?p=65830 2024-05-09T21:41:21Z 2023-06-02T16:00:00Z

When you see a context-relevant advertisement on a web page, it's most likely content served by a Taboola data pipeline. As the leading content recommendation...]]>

When you see a context-relevant advertisement on a web page, it's most likely content served by a Taboola data pipeline. As the leading content recommendation... Picture of an aisle in a data center, with servers on either side.

Picture of an aisle in a data center, with servers on either side.

When you see a context-relevant advertisement on a web page, it��s most likely content served by a Taboola data pipeline. As the leading content recommendation company in the world, a big challenge for Taboola was the frequent need to scale Apache Spark CPU cluster capacity to address the constantly growing compute and storage requirements. Data center capacity and hardware costs are always��

]]> 1 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.5]]> http://www.open-lab.net/blog/?p=65459 2023-06-09T20:20:37Z 2023-05-25T19:00:00Z

This update expands platform support and provides minor updates.]]>

This update expands platform support and provides minor updates. Abstract image.

Abstract image.

This update expands platform support and provides minor updates.

]]> 0 Sagar Singh <![CDATA[Increasing Throughput and Reducing Costs for AI-Based Computer Vision with CV-CUDA]]> http://www.open-lab.net/blog/?p=63947 2023-10-25T23:51:23Z 2023-05-04T19:26:16Z

Real-time cloud-scale applications that involve AI-based computer vision are growing rapidly. The use cases include image understanding, content creation,...]]>

Real-time cloud-scale applications that involve AI-based computer vision are growing rapidly. The use cases include image understanding, content creation,...

cv-CUDA featured

Real-time cloud-scale applications that involve AI-based computer vision are growing rapidly. The use cases include image understanding, content creation, content moderation, mapping, recommender systems, and video conferencing. However, the compute cost of these workloads is growing too, driven by demand for increased sophistication in the processing. The shift from still images to video is��

]]> 3 Alan Gray <![CDATA[A Guide to CUDA Graphs in GROMACS 2023]]> http://www.open-lab.net/blog/?p=63250 2023-06-09T22:31:08Z 2023-04-14T18:10:14Z

GPUs continue to get faster with each new generation, and it is often the case that each activity on the GPU (such as a kernel or memory copy) completes very...]]>

GPUs continue to get faster with each new generation, and it is often the case that each activity on the GPU (such as a kernel or memory copy) completes very... Biomolecule

Biomolecule

GPUs continue to get faster with each new generation, and it is often the case that each activity on the GPU (such as a kernel or memory copy) completes very quickly. In the past, each activity had to be separately scheduled (launched) by the CPU, and associated overheads could accumulate to become a performance bottleneck. The CUDA Graphs facility addresses this problem by enabling multiple GPU��

]]> 1 Marius Koch <![CDATA[Using Carbon Capture and Storage Digital Twins for Net Zero Strategies]]> http://www.open-lab.net/blog/?p=63101 2023-04-06T19:11:27Z 2023-04-06T18:44:43Z

CO2 capture and storage technologies (CCS) catch CO2 from its production source, compress it, transport it through pipelines or by ships, and store it...]]>

CO2 capture and storage technologies (CCS) catch CO2 from its production source, compress it, transport it through pipelines or by ships, and store it... Image of underground reservoir modeling.

Image of underground reservoir modeling.

CO2 capture and storage technologies (CCS) catch CO2 from its production source, compress it, transport it through pipelines or by ships, and store it underground. CCS enables industries to massively reduce their CO2 emissions and is a powerful tool to help industrial manufacturers achieve net-zero goals. In many heavy industrial processes, greenhouse gas (GHG) emissions cannot be avoided in the��

]]> 1 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.3]]> http://www.open-lab.net/blog/?p=62843 2023-06-09T22:32:35Z 2023-04-03T17:15:35Z

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.]]>

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK. Abstract image.

Abstract image.

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.

]]> 0 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.1]]> http://www.open-lab.net/blog/?p=59890 2023-06-12T08:00:43Z 2023-01-25T20:00:00Z

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.]]>

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements. Abstract image.

Abstract image.

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.

]]> 0 Tanya Lenz <![CDATA[New Course: Introduction to Graph Neural Networks]]> http://www.open-lab.net/blog/?p=59284 2023-01-26T19:29:45Z 2023-01-17T18:00:00Z

Learn the basic concepts, implementations, and applications of graph neural networks (GNNs) in this new self-paced course from NVIDIA Deep Learning Institute.]]>

Learn the basic concepts, implementations, and applications of graph neural networks (GNNs) in this new self-paced course from NVIDIA Deep Learning Institute.

Graph neural network course

Learn the basic concepts, implementations, and applications of graph neural networks (GNNs) in this new self-paced course from NVIDIA Deep Learning Institute.

]]> 0 Tanya Lenz <![CDATA[New Workshop: Data Parallelism: How to Train Deep Learning Models on Multiple GPUs]]> http://www.open-lab.net/blog/?p=57722 2023-06-12T08:32:16Z 2022-11-29T18:30:00Z

Learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU.]]>

Learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU.

dli-social-workshops-data-parallelism-train-deep-learning-models-multiple-gpus-2500300-2048x1024-r1 (1)

Learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU.

]]> 0 Jay Gould <![CDATA[New Asynchronous Programming Model Library Now Available with NVIDIA HPC SDK v22.11]]> http://www.open-lab.net/blog/?p=57499 2023-05-24T00:18:31Z 2022-11-17T15:00:00Z

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA...]]>

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA...

image1 (1)

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA Developer Program can download the release now for free. The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools for high performance computing (HPC) developers. It provides everything developers need to��

]]> 0 Craig Clawson <![CDATA[Upcoming Workshop: Model Parallelism: Building and Deploying Large Neural Networks (EMEA)]]> http://www.open-lab.net/blog/?p=55871 2023-06-12T08:51:30Z 2022-10-13T16:00:44Z

Learn how to train the largest of deep neural networks and deploy them to production.]]>

Learn how to train the largest of deep neural networks and deploy them to production.

2455491-ent-dig-model-parallelism

Learn how to train the largest of deep neural networks and deploy them to production.

]]> 0 Charu Chaubal <![CDATA[Powering NVIDIA-Certified Enterprise Systems with Arm CPUs]]> http://www.open-lab.net/blog/?p=55618 2022-10-20T19:33:18Z 2022-09-28T17:00:00Z

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and...]]>

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and...

Temp-1920x1080 (3)

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and efficiency. One of the most prevalent trends today is the use of CPUs based on Arm architecture to build data center servers. To ensure that these new systems are enterprise-ready and optimally configured, NVIDIA has approved the��

]]> 0 Denis Timonin <![CDATA[Deploying GPT-J and T5 with NVIDIA Triton Inference Server]]> http://www.open-lab.net/blog/?p=51318 2023-03-14T23:22:55Z 2022-08-03T17:00:00Z

This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to...]]>

This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to...

denis_feature_resize

This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to the FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates��

]]> 7 Miko Stulajter <![CDATA[Using Fortran Standard Parallel Programming for GPU Acceleration]]> http://www.open-lab.net/blog/?p=48632 2023-12-05T21:53:22Z 2022-06-12T21:28:55Z

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and...]]>

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and...

Using Fortran Standard Parallel Programming for GPU Acceleration

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, loops and array math intrinsics in Fortran. This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing: Using standard��

]]> 8 Ashraf Eassa <![CDATA[Fueling High-Performance Computing with Full-Stack Innovation]]> http://www.open-lab.net/blog/?p=48769 2023-07-05T19:27:52Z 2022-06-02T18:45:00Z

High-performance computing (HPC) has become the essential instrument of scientific discovery. Whether it is discovering new, life-saving drugs, battling...]]>

High-performance computing (HPC) has become the essential instrument of scientific discovery. Whether it is discovering new, life-saving drugs, battling...

Fueling High Performance Computing with Full-Stack Innovation

High-performance computing (HPC) has become the essential instrument of scientific discovery. Whether it is discovering new, life-saving drugs, battling climate change, or creating accurate simulations of our world, these solutions demand an enormous��and rapidly growing��amount of processing power. They are increasingly out of reach of traditional computing approaches.

]]> 1 Jonas Latt <![CDATA[Multi-GPU Programming with Standard Parallel C++, Part 2]]> http://www.open-lab.net/blog/?p=44906 2023-12-05T21:52:40Z 2022-04-18T23:20:23Z

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the...]]>

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less��

]]> 0 Jonas Latt <![CDATA[Multi-GPU Programming with Standard Parallel C++, Part 1]]> http://www.open-lab.net/blog/?p=44904 2023-12-05T21:52:55Z 2022-04-18T23:18:13Z

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by...]]>

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by calling into an existing GPU-optimized library. This is, for example, when the building blocks of your simulation software consist of BLAS linear algebra functions, which can be accelerated using cuBLAS. This is the second post in the��

]]> 0 Leopold Cambier <![CDATA[Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale]]> http://www.open-lab.net/blog/?p=43439 2022-08-21T23:53:21Z 2022-01-27T23:22:41Z

Today, NVIDIA announces the release of cuFFTMp for Early Access (EA). cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and...]]>

Today, NVIDIA announces the release of cuFFTMp for Early Access (EA). cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and...

datacenter

Today, NVIDIA announces the release of cuFFTMp for Early Access (EA). cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms. FFTs (Fast Fourier Transforms) are widely used in a variety of fields, ranging from molecular dynamics, signal processing, computational fluid dynamics (CFD) to wireless��

]]> 2 Corey Nolet <![CDATA[Analyzing the RNA-Sequence of 1.3M Mouse Brain Cells with RAPIDS on NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=37158 2022-08-21T23:52:37Z 2021-09-08T19:40:00Z

Single-cell genomics research continues to advance drug discovery for disease prevention. For example, it has been pivotal in developing treatments for the...]]>

Single-cell genomics research continues to advance drug discovery for disease prevention. For example, it has been pivotal in developing treatments for the...

DNA genome research. DNA molecule structure. 3D double helix illustration. Genetic engineering of the future

Single-cell genomics research continues to advance drug discovery for disease prevention. For example, it has been pivotal in developing treatments for the current COVID-19 pandemic, identifying cells susceptible to infection, and revealing changes in the immune systems of infected patients. However, with the growing availability of large-scale single-cell datasets, it��s clear that computing��

]]> 0 Logan Herche <![CDATA[Achieve up to 75% Performance Improvement for Communication Intensive HPC Applications with NVTAGS]]> http://www.open-lab.net/blog/?p=33648 2022-08-21T23:52:02Z 2021-06-23T21:29:15Z

Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems,...]]>

Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems,...

Devblog Devnews featured image 1000 x 600

Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems, different GPU pairs share communication links with varying bandwidth and latency. As a result, GPU assignment can substantially impact time to solution. Furthermore, on multi-node / multi-socket systems, communication performance can degrade��

]]> 0 CJ Newburn <![CDATA[Accelerating IO in the Modern Data Center: Network IO]]> http://www.open-lab.net/blog/?p=21733 2022-08-21T23:40:44Z 2020-10-20T19:13:11Z

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern...]]>

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern...

gdr-direct-connection-for-gpus

This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern data center. The first post in this series introduced the Magnum IO architecture and positioned it in the broader context of CUDA, CUDA-X, and vertical application domains. Of the four major components of the architecture��

]]> 1 CJ Newburn <![CDATA[Accelerating IO in the Modern Data Center: Magnum IO Architecture]]> http://www.open-lab.net/blog/?p=21121 2023-03-22T01:09:09Z 2020-10-05T13:00:00Z

This is the first post in the Accelerating IO series, which describes the architecture, components, storage, and benefits of Magnum IO, the IO subsystem of the...]]>

This is the first post in the Accelerating IO series, which describes the architecture, components, storage, and benefits of Magnum IO, the IO subsystem of the...

magnum-io-stack-feature

This is the first post in the Accelerating IO series, which describes the architecture, components, storage, and benefits of Magnum IO, the IO subsystem of the modern data center. Previously the boundary of the unit of computing, sheet metal no longer constrains the resources that can be applied to a single problem or the data set that can be housed. The new unit is the data center.

]]> 3 Alexey Panteleev <![CDATA[Rendering Perfect Reflections and Refractions in Path-Traced Games]]> http://www.open-lab.net/blog/?p=19536 2023-10-25T23:53:16Z 2020-08-13T23:30:23Z

With the introduction of hardware-accelerated ray tracing in NVIDIA Turing GPUs, game developers have received an opportunity to significantly improve the...]]>

With the introduction of hardware-accelerated ray tracing in NVIDIA Turing GPUs, game developers have received an opportunity to significantly improve the...

rtx railgun trail quake 2

With the introduction of hardware-accelerated ray tracing in NVIDIA Turing GPUs, game developers have received an opportunity to significantly improve the realism of gameplay rendered in real time. In these early days of real-time ray tracing, most implementations are limited to one or two ray-tracing effects, such as shadows or reflections. If the game��s geometry and materials are simple enough��

]]> 0 Adolf Hohl <![CDATA[Validating Distributed Multi-Node Autonomous Vehicle AI Training with NVIDIA DGX Systems on OpenShift with DXC Robotic Drive]]> http://www.open-lab.net/blog/?p=19146 2022-08-21T23:40:25Z 2020-07-30T00:32:18Z

Deep neural network (DNN) development for self-driving cars is a demanding workload. In this post, we validate DGX multi-node, multi-GPU, distributed training...]]>

Deep neural network (DNN) development for self-driving cars is a demanding workload. In this post, we validate DGX multi-node, multi-GPU, distributed training...

featured-image-dxc-stack

Deep neural network (DNN) development for self-driving cars is a demanding workload. In this post, we validate DGX multi-node, multi-GPU, distributed training running on RedHat OpenShift in the DXC Robotic Drive environment. We used OpenShift 3.11, also a part of the Robotic Drive containerized compute platform, to orchestrate and execute the deep learning (DL) workloads.

]]> 0 Cory Perry <![CDATA[Introducing Low-Level GPU Virtual Memory Management]]> http://www.open-lab.net/blog/?p=16913 2024-07-30T22:16:24Z 2020-04-15T22:00:00Z

There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to...]]>

There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to...

resize-buffer-example

]]> 59 Bruce Tannenbaum <![CDATA[Speeding Up Semantic Segmentation Using MATLAB Container from NVIDIA NGC]]> http://www.open-lab.net/blog/?p=13730 2023-02-13T17:38:47Z 2019-03-13T14:00:24Z

Gone are the days of using a single GPU to train a deep learning model. ?With computationally intensive algorithms such as semantic segmentation, a single GPU...]]>

Gone are the days of using a single GPU to train a deep learning model. ?With computationally intensive algorithms such as semantic segmentation, a single GPU...

Gone are the days of using a single GPU to train a deep learning model. With computationally intensive algorithms such as semantic segmentation, a single GPU can take days to optimize a model. But multi-GPU hardware is expensive, you say. Not any longer; NVIDIA multi-GPU hardware on cloud instances like the AWS P3 allow you to pay for only what you use. Cloud instances allow you to take��

]]> 0 Nathan Luehr <![CDATA[Fast Multi-GPU collectives with NCCL]]> http://www.open-lab.net/blog/parallelforall/?p=6598 2022-08-21T23:37:50Z 2016-04-07T15:27:54Z

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in...]]>

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in... Figure 5: Ring order of GPUs in PCIe tree.

Figure 5: Ring order of GPUs in PCIe tree.

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor��

]]> 14 Davide Rossetti <![CDATA[Benchmarking GPUDirect RDMA on Modern Server Platforms]]> http://www.open-lab.net/blog/parallelforall/?p=3451 2023-07-05T19:44:19Z 2014-10-08T02:27:45Z

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI...]]>

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI...

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose��

]]> 40 Pradeep Gupta <![CDATA[How to Build a GPU-Accelerated Research Cluster]]> http://www.parallelforall.com/?p=1390 2022-08-21T23:36:53Z 2013-04-30T09:31:46Z

Some of the fastest computers in the world are cluster computers. A?cluster is a computer system comprising two or more computers ("nodes") connected with a...]]>

Some of the fastest computers in the world are cluster computers. A?cluster is a computer system comprising two or more computers ("nodes") connected with a... CUDA AI Cube

CUDA AI Cube

Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (��nodes��) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC)��

]]> 15 Jiri Kraus <![CDATA[Benchmarking CUDA-Aware MPI]]> http://www.parallelforall.com/?p=1171 2023-07-05T19:44:41Z 2013-03-28T03:29:29Z

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I...]]>

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I...

LaunchMPI

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I will demonstrate the performance of MPI through both synthetic and realistic benchmarks. Since you now know why CUDA-aware MPI is more efficient from a theoretical perspective, let��s take a look at the results of MPI bandwidth and��

]]> 16 Jiri Kraus <![CDATA[An Introduction to CUDA-Aware MPI]]> http://www.parallelforall.com/?p=1362 2022-08-21T23:36:53Z 2013-03-14T02:18:53Z

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed?processes that is?commonly used in HPC to build...]]>

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed?processes that is?commonly used in HPC to build...

UVA

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel��

]]> 5 ��˳��97caoporen��