Shared Memory – NVIDIA Technical Blog

Shared Memory – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-03T22:20:47Z http://www.open-lab.net/blog/feed/ Robert Jensen <![CDATA[Just Released: cuDSS 0.3.0]]> http://www.open-lab.net/blog/?p=84434 2024-07-25T18:19:14Z 2024-07-03T15:00:00Z

cuDSS (Preview) is an accelerated direct sparse solver. It now supports multi-GPU multi-node platforms, and introduces a hybrid memory mode.]]>

cuDSS (Preview) is an accelerated direct sparse solver. It now supports multi-GPU multi-node platforms, and introduces a hybrid memory mode.

NVIDIA cuDSS 3.0

cuDSS (Preview) is an accelerated direct sparse solver. It now supports multi-GPU multi-node platforms, and introduces a hybrid memory mode.

]]> 0 Rob Van der Wijngaart <![CDATA[Boosting Application Performance with GPU Memory Prefetching]]> http://www.open-lab.net/blog/?p=45713 2023-06-12T20:54:17Z 2022-03-23T15:02:00Z

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...]]>

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...

CUDA Blog Image 1000x600

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth. In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected��

]]> 7 Matthieu Tardy <![CDATA[Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture]]> http://www.open-lab.net/blog/?p=20958 2024-05-23T13:14:04Z 2020-09-23T00:23:52Z

The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These...]]>

The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These...

The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These mechanisms include asynchronously copying data into shared memory and influencing the residency of data in the L2 cache. This post walks through how to use the asynchronous copy feature, and how to set up your algorithms to overlap��

]]> 0 Nikolay Sakharnykh <![CDATA[GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell]]> http://www.open-lab.net/blog/parallelforall/?p=4175 2025-05-02T18:04:42Z 2015-03-17T16:34:16Z

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...]]>

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical... GPU Pro Tip

GPU Pro Tip

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let��s focus on one of the most common use cases: an image histogram using 256 bins for each��

]]> 10 Mark Harris <![CDATA[CUDA Pro Tip: Do The Kepler Shuffle]]> http://www.open-lab.net/blog/parallelforall/?p=2626 2022-08-21T23:37:03Z 2014-02-03T11:00:00Z

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...]]>

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...

cuda_pro_tip

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others�� registers by using a new instruction called SHFL��

]]> 3 Greg Ruetsch <![CDATA[Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)]]> http://www.open-lab.net/blog/parallelforall/?p=2361 2022-08-21T23:36:58Z 2014-01-02T06:19:45Z

This post is an excerpt from Chapter 4 of the book?CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...]]>

This post is an excerpt from Chapter 4 of the book?CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an �� = 1024 �� 768 element matrix that is distributed amongst four devices.

]]> 2 Mark Harris <![CDATA[Finite Difference Methods in CUDA C++, Part 2]]> http://www.parallelforall.com/?p=1399 2022-08-21T23:36:53Z 2013-04-09T06:47:04Z

In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x?derivative part of the...]]>

In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x?derivative part of the...

CUDA_Cube_1K

In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x derivative part of the computation. In this post, let��s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify the��

]]> 3 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 2]]> http://www.parallelforall.com/?p=1177 2022-08-21T23:36:53Z 2013-04-02T01:26:53Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let��s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify��

]]> 5 Mark Harris <![CDATA[Finite Difference Methods in CUDA C/C++, Part 1]]> http://www.parallelforall.com/?p=1230 2022-08-21T23:36:53Z 2013-03-04T16:54:19Z

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement...]]>

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement...

CUDA_Cube_1K

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today��s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory��

]]> 13 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 1]]> http://www.parallelforall.com/?p=713 2022-08-21T23:36:49Z 2013-02-26T18:21:48Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the last CUDA Fortran post we investigated how shared memory can be used to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today��s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory��

]]> 0 Mark Harris <![CDATA[An Efficient Matrix Transpose in CUDA C/C++]]> http://www.parallelforall.com/?p=1166 2022-08-21T23:36:51Z 2013-02-19T04:49:19Z

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...]]>

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...

CUDA_Cube_1K

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose of a��

]]> 31 Greg Ruetsch <![CDATA[An Efficient Matrix Transpose in CUDA Fortran]]> http://www.parallelforall.com/?p=579 2022-08-21T23:36:48Z 2013-02-07T19:42:42Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose��

]]> 2 Mark Harris <![CDATA[Using Shared Memory in CUDA C/C++]]> http://www.parallelforall.com/?p=964 2022-08-21T23:36:50Z 2013-01-29T07:18:11Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

CUDA_Cube_1K

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA��

]]> 36 Greg Ruetsch <![CDATA[Using Shared Memory in CUDA Fortran]]> http://www.parallelforall.com/?p=548 2023-06-12T21:18:21Z 2013-01-15T12:01:23Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

cuda_fortran_simple

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of��

]]> 0 ��˳��97caoporen��