parallel programming – NVIDIA Technical Blog

parallel programming – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-06-05T19:36:16Z http://www.open-lab.net/blog/feed/ Graham Lopez <![CDATA[Just Released: NVIDIA HPC SDK v25.3]]> http://www.open-lab.net/blog/?p=98646 2025-04-17T19:35:30Z 2025-04-10T20:20:32Z

The HPC SDK v25.3 release includes support for NVIDIA Blackwell GPUs and an optimized allocator for Arm CPUs.]]>

The HPC SDK v25.3 release includes support for NVIDIA Blackwell GPUs and an optimized allocator for Arm CPUs.

hpc-compilation

The HPC SDK v25.3 release includes support for NVIDIA Blackwell GPUs and an optimized allocator for Arm CPUs.

]]> 0 Ioana Boier <![CDATA[Profit and Loss Modeling on GPUs with ISO C++ Language Parallelism]]> http://www.open-lab.net/blog/?p=85106 2024-08-22T18:25:37Z 2024-08-07T16:30:00Z

The previous post How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism demonstrated how to write a Black-Scholes simulation using ISO C++...]]>

The previous post How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism demonstrated how to write a Black-Scholes simulation using ISO C++... Decorative image of a profit/loss graph.

Decorative image of a profit/loss graph.

The previous post How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism demonstrated how to write a Black-Scholes simulation using ISO C++ standard parallelism with the code found in the /NVIDIA/accelerated-quant-finance GitHub repo. This approach enables you to productively write code that is both concise and portable. Using solely standard C++, it��s possible to write an��

]]> 0 Ioana Boier <![CDATA[How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism]]> http://www.open-lab.net/blog/?p=78691 2024-04-09T23:45:35Z 2024-03-06T19:00:00Z

Quantitative finance libraries are software packages that consist of mathematical, statistical, and, more recently, machine learning models designed for use in...]]>

Quantitative finance libraries are software packages that consist of mathematical, statistical, and, more recently, machine learning models designed for use in...

graph-grid-background

Quantitative finance libraries are software packages that consist of mathematical, statistical, and, more recently, machine learning models designed for use in quantitative investment contexts. They contain a wide range of functionalities, often proprietary, to support the valuation, risk management, construction, and optimization of investment portfolios. Financial firms that develop such��

]]> 1 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v24.1]]> http://www.open-lab.net/blog/?p=77283 2024-02-22T19:59:05Z 2024-02-01T16:36:12Z

This NVIDIA HPC SDK update includes the cuBLASMp preview library, along with minor bug fixes and enhancements.]]>

This NVIDIA HPC SDK update includes the cuBLASMp preview library, along with minor bug fixes and enhancements. Illustration representing HPC.

Illustration representing HPC.

This NVIDIA HPC SDK update includes the cuBLASMp preview library, along with minor bug fixes and enhancements.

]]> 0 Tanya Lenz <![CDATA[Webinar: Analysis of OpenACC Validation and Verification Testsuite]]> http://www.open-lab.net/blog/?p=74475 2023-12-14T19:29:46Z 2023-12-01T21:00:00Z

On December 7, learn how to verify OpenACC implementations across compilers and system architectures with the validation testsuite.]]>

On December 7, learn how to verify OpenACC implementations across compilers and system architectures with the validation testsuite.

man-laptop-webinar

On December 7, learn how to verify OpenACC implementations across compilers and system architectures with the validation testsuite.

]]> 0 Graham Lopez <![CDATA[Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip]]> http://www.open-lab.net/blog/?p=72720 2023-11-16T19:16:39Z 2023-11-13T17:13:02Z

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most...]]>

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most...

nvidia-grace-hopper

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that the user can develop their application for both processors while using a single, unified address space.

]]> 1 Stefan Maintz <![CDATA[Optimize Energy Efficiency of Multi-Node VASP Simulations with NVIDIA Magnum IO]]> http://www.open-lab.net/blog/?p=72724 2023-11-20T18:42:51Z 2023-11-13T16:00:00Z

Computational energy efficiency has become a primary decision criterion for most supercomputing centers. Data centers, once built, are capped in terms of the...]]>

Computational energy efficiency has become a primary decision criterion for most supercomputing centers. Data centers, once built, are capped in terms of the...

hfo2-332-polyhedra-resized (002)_16x9

Computational energy efficiency has become a primary decision criterion for most supercomputing centers. Data centers, once built, are capped in terms of the amount of power they can use without expensive and time-consuming retrofits. Maximizing insight in the form of workload throughput then means maximizing workload per watt. NVIDIA products have, for several generations��

]]> 0 Tanya Lenz <![CDATA[Just Released: NVIDIA HPC SDK 23.9]]> http://www.open-lab.net/blog/?p=71163 2023-11-02T18:14:44Z 2023-10-05T20:00:00Z

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.]]>

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.

networking-infiniband-dpu-for-hpc

This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.

]]> 0 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.7]]> http://www.open-lab.net/blog/?p=68650 2024-08-28T17:38:15Z 2023-07-31T19:00:00Z

NVIDIA HPC SDK version 23.7 is now available and provides minor updates and enhancements.]]>

NVIDIA HPC SDK version 23.7 is now available and provides minor updates and enhancements. Abstract image with three different illustrations representing HPC applications.

Abstract image with three different illustrations representing HPC applications.

NVIDIA HPC SDK version 23.7 is now available and provides minor updates and enhancements.

]]> 0 Michelle Horton <![CDATA[Model Parallelism and Conversational AI?Workshops]]> http://www.open-lab.net/blog/?p=66993 2023-07-13T19:00:31Z 2023-06-22T19:57:14Z

Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline.]]>

Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline. person typing at computer.

person typing at computer.

Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline.

]]> 0 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.5]]> http://www.open-lab.net/blog/?p=65459 2023-06-09T20:20:37Z 2023-05-25T19:00:00Z

This update expands platform support and provides minor updates.]]>

This update expands platform support and provides minor updates. Abstract image.

Abstract image.

This update expands platform support and provides minor updates.

]]> 0 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.3]]> http://www.open-lab.net/blog/?p=62843 2023-06-09T22:32:35Z 2023-04-03T17:15:35Z

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.]]>

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK. Abstract image.

Abstract image.

Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.

]]> 0 Jay Gould <![CDATA[Just Released: NVIDIA HPC SDK v23.1]]> http://www.open-lab.net/blog/?p=59890 2023-06-12T08:00:43Z 2023-01-25T20:00:00Z

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.]]>

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements. Abstract image.

Abstract image.

Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.

]]> 0 Jay Gould <![CDATA[New Asynchronous Programming Model Library Now Available with NVIDIA HPC SDK v22.11]]> http://www.open-lab.net/blog/?p=57499 2023-05-24T00:18:31Z 2022-11-17T15:00:00Z

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA...]]>

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA...

image1 (1)

Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA Developer Program can download the release now for free. The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools for high performance computing (HPC) developers. It provides everything developers need to��

]]> 0 Stefan Maintz <![CDATA[Scaling VASP with NVIDIA Magnum IO]]> http://www.open-lab.net/blog/?p=57394 2023-04-17T02:20:16Z 2022-11-15T21:42:07Z

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are...]]>

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are...

vasp-magnum-io-featured

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are named not for leaders or civilizations but for the materials that defined them: Stone Age, Bronze Age, and so on. The current digital or information age could be renamed the Silicon or Semiconductor Age and retain the same meaning.

]]> 1 Jay Gould <![CDATA[Just Released: HPC SDK v22.9]]> http://www.open-lab.net/blog/?p=54598 2023-06-12T08:56:53Z 2022-10-12T19:00:00Z

This version 22.9 update to the NVIDIA HPC SDK includes fixes and minor enhancements.]]>

This version 22.9 update to the NVIDIA HPC SDK includes fixes and minor enhancements. Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

This version 22.9 update to the NVIDIA HPC SDK includes fixes and minor enhancements.

]]> 0 Jay Gould <![CDATA[Top HPC Sessions at GTC 2022]]> http://www.open-lab.net/blog/?p=54608 2023-06-12T08:56:40Z 2022-09-15T18:59:00Z

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions. ]]>

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.

My project (2)

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.

]]> 0 Jay Gould <![CDATA[Just Released: New Arm CPU Support and Advancements in HPC SDK 22.7]]> http://www.open-lab.net/blog/?p=50923 2022-09-09T16:10:22Z 2022-07-27T20:00:00Z

This release includes enhancements, fixes, and new support for Arm SVE, Rocky Linux OS, and Amazon EC2 C7g instances, powered by the latest generation AWS...]]>

This release includes enhancements, fixes, and new support for Arm SVE, Rocky Linux OS, and Amazon EC2 C7g instances, powered by the latest generation AWS... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

This release includes enhancements, fixes, and new support for Arm SVE, Rocky Linux OS, and Amazon EC2 C7g instances, powered by the latest generation AWS Graviton3 processors.

]]> 0 Alex McCaskey <![CDATA[Introducing NVIDIA CUDA-Q: The Platform for Hybrid Quantum-Classical Computing]]> http://www.open-lab.net/blog/?p=50278 2024-05-07T19:29:57Z 2022-07-14T17:00:00Z

The past decade has seen quantum computing leap out of academic labs into the mainstream. Efforts to build better quantum computers proliferate at both startups...]]>

The past decade has seen quantum computing leap out of academic labs into the mainstream. Efforts to build better quantum computers proliferate at both startups...

hpc-blog-qoda-annnouncement-li-tw-1260x680

The past decade has seen quantum computing leap out of academic labs into the mainstream. Efforts to build better quantum computers proliferate at both startups and large companies. And while it is still unclear how far we are away from using quantum advantage on common problems, it is clear that now is the time to build the tools needed to deliver valuable quantum applications. To start��

]]> 5 Elena Agostini <![CDATA[Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs]]> http://www.open-lab.net/blog/?p=46566 2023-10-23T17:21:50Z 2022-04-29T03:39:55Z

The inline processing of network packets using GPUs is a packet-analysis technique useful to a number of different application domains: signal processing,...]]>

The inline processing of network packets using GPUs is a packet-analysis technique useful to a number of different application domains: signal processing,...

cuda-image-16x9

The inline processing of network packets using GPUs is a packet-analysis technique useful to a number of different application domains: signal processing, network security, information gathering, input reconstruction, and so on. The main requirement of these application types is to move received packets into GPU memory as soon as possible, to trigger the CUDA kernel responsible to execute��

]]> 17 Jonas Latt <![CDATA[Multi-GPU Programming with Standard Parallel C++, Part 2]]> http://www.open-lab.net/blog/?p=44906 2023-12-05T21:52:40Z 2022-04-18T23:20:23Z

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the...]]>

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less��

]]> 0 Jonas Latt <![CDATA[Multi-GPU Programming with Standard Parallel C++, Part 1]]> http://www.open-lab.net/blog/?p=44904 2023-12-05T21:52:55Z 2022-04-18T23:18:13Z

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by...]]>

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by calling into an existing GPU-optimized library. This is, for example, when the building blocks of your simulation software consist of BLAS linear algebra functions, which can be accelerated using cuBLAS. This is the second post in the��

]]> 0 Jeff Larkin http://jefflarkin.com <![CDATA[Developing Accelerated Code with Standard Language Parallelism]]> http://www.open-lab.net/blog/?p=43006 2025-05-29T16:29:28Z 2022-01-12T17:14:46Z

The NVIDIA platform is the most mature and complete platform for accelerated computing. In this post, I address the simplest, most productive, and most portable...]]>

The NVIDIA platform is the most mature and complete platform for accelerated computing. In this post, I address the simplest, most productive, and most portable... Four panels vertically laid out each showing a simulation with a black background

Four panels vertically laid out each showing a simulation with a black background

The NVIDIA platform is the most mature and complete platform for accelerated computing. In this post, I address the simplest, most productive, and most portable approach to accelerated computing. This is the first post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing��

]]> 0 Ram Cherukuri <![CDATA[Exploring the New Features of CUDA 11.3]]> http://www.open-lab.net/blog/?p=30563 2024-08-28T17:49:06Z 2021-04-16T00:40:00Z

CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use...]]>

CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use...

CUDA Blog Image 1000x600

CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use NVIDIA GPUs. CUDA is ideal for diverse workloads from high performance computing, data science analytics, and AI applications. The latest release, CUDA 11.3, and its features are focused on enhancing the programming model and performance of��

]]> 2 Ram Cherukuri <![CDATA[Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features]]> http://www.open-lab.net/blog/?p=22770 2024-08-28T17:54:37Z 2020-12-16T16:00:00Z

CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every...]]>

CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every...

CUDA_3x2

CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every NVIDIA GPU platform for general purpose compute acceleration. The latest CUDA release, CUDA 11.2, is focused on improving the user experience and application performance for CUDA developers. CUDA 11.2��

]]> 0 Brent Leback <![CDATA[Bringing Tensor Cores to Standard Fortran]]> http://www.open-lab.net/blog/?p=19380 2023-06-12T21:14:42Z 2020-08-07T19:35:38Z

Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that...]]>

Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that...

Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that need to run on a variety of platforms, adapting library calls for each vendor or library version can be a maintenance nightmare. A compiler that can automatically generate calls to tuned math libraries gives you the best of both��

]]> 1 Pradeep Gupta <![CDATA[CUDA Refresher: The CUDA Programming Model]]> http://www.open-lab.net/blog/?p=18697 2023-06-12T21:15:43Z 2020-06-26T17:48:12Z

This is the fourth post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...]]>

This is the fourth post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...

memory-hierarchy-in-gpus (2)

]]> 2 Pradeep Gupta <![CDATA[CUDA Refresher: The GPU Computing Ecosystem]]> http://www.open-lab.net/blog/?p=18011 2022-08-21T23:40:10Z 2020-05-21T23:20:00Z

This is the third post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...]]>

This is the third post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...

cuda-ecosystem-2

This is the third post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers. Ease of programming and a giant leap in performance is one of the key reasons for the CUDA platform��s widespread adoption. The second biggest reason for the success of the CUDA platform is the availability of a broad and��

]]> 0 Pradeep Gupta <![CDATA[CUDA Refresher: Getting started with CUDA]]> http://www.open-lab.net/blog/?p=17309 2022-08-21T23:40:01Z 2020-05-06T18:08:15Z

This is the second post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...]]>

This is the second post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or...

cuda-blocks-scalability

This is the second post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers. Advancements in science and business drive an insatiable demand for more computing resources and acceleration of workloads. Parallel programming is a profound way for developers to accelerate their applications. However��

]]> 0 Mark Harris <![CDATA[Cooperative Groups: Flexible CUDA Thread Programming]]> http://www.open-lab.net/blog/parallelforall/?p=8415 2023-06-12T21:16:47Z 2017-10-05T04:17:43Z

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The...]]>

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The...

AO0zQhrL2Kpvi1Z6sCB4rr6-_faEEtnNgphE1ewGgDeKOIkOocFSBe-elSGLs92pa19Zbgs2048.png

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups��

]]> 32 Patric Zhao <![CDATA[Accelerate R Applications with CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=3369 2022-08-21T23:37:06Z 2014-08-05T02:13:24Z

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations...]]>

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations...

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the��

]]> 19 Adam McLaughlin http://users.ece.gatech.edu/~amclaughlin7/index.html <![CDATA[Accelerating Graph Betweenness Centrality with CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=3380 2022-08-21T23:37:06Z 2014-07-24T02:13:53Z

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph...]]>

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph...

betweenness_featured

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best��

]]> 3 Dustin Franklin https://www.linkedin.com/in/dustin-franklin-b3aaa173 <![CDATA[Low-Power Sensing and Autonomy With NVIDIA Jetson TK1]]> http://www.open-lab.net/blog/parallelforall/?p=3339 2022-08-21T23:37:06Z 2014-06-25T18:04:02Z

[caption id="attachment_3344" align="alignright" width="314"] Figure 1: simple TK1 block diagram[/caption] NVIDIA��s Tegra K1 (TK1) is the first Arm...]]>

[caption id="attachment_3344" align="alignright" width="314"] Figure 1: simple TK1 block diagram[/caption] NVIDIA��s Tegra K1 (TK1) is the first Arm...

vehicle_detectors

NVIDIA��s Tegra K1 (TK1) is the first Arm system-on-chip (SoC) with integrated CUDA. With 192 Kepler GPU cores and four Arm Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM). This brings game-changing performance to low-SWaP (Size��

]]> 6 Nikolay Markovskiy <![CDATA[Drop-in Acceleration of GNU Octave]]> http://www.open-lab.net/blog/parallelforall/?p=3188 2022-08-21T23:37:04Z 2014-06-05T16:20:10Z

cuBLAS?is an implementation of the?BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.? However, cuBLAS can not be used as a...]]>

cuBLAS?is an implementation of the?BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.? However, cuBLAS can not be used as a...

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API: Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated��

]]> 10 Justin Luitjens <![CDATA[Faster Parallel Reductions on Kepler]]> http://www.open-lab.net/blog/parallelforall/?p=2551 2022-08-21T23:37:02Z 2014-02-14T04:30:44Z

Parallel reduction is a common building block for many parallel algorithms. A?presentation from 2007 by Mark Harris?provided a detailed strategy for...]]>

Parallel reduction is a common building block for many parallel algorithms. A?presentation from 2007 by Mark Harris?provided a detailed strategy for...

Kepler_reductions_thumb

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory��

]]> 53 Tero Karras https://research.nvidia.com/users/tero-karras <![CDATA[Thinking Parallel, Part III: Tree Construction on the GPU]]> http://www.parallelforall.com/?p=635 2022-08-21T23:36:49Z 2012-12-20T05:15:09Z

In part II?of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and?we...]]>

In part II?of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and?we...

fig06-numbering

In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part��

]]> 28 Tero Karras https://research.nvidia.com/users/tero-karras <![CDATA[Thinking Parallel, Part II: Tree Traversal on the GPU]]> http://www.parallelforall.com/?p=632 2022-08-21T23:36:49Z 2012-11-27T04:06:33Z

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs...]]>

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs...

fig03-bvh

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while uniform grid relies on too many simplifying assumptions that��

]]> 6 Tero Karras https://research.nvidia.com/users/tero-karras <![CDATA[Thinking Parallel, Part I: Collision Detection on the GPU]]> http://test.markmark.net/?p=333 2022-08-21T23:36:47Z 2012-11-12T14:53:37Z

This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using...]]>

This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using...

fig01-sort-and-sweep

This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using broad-phase collision detection as an example. The first part will give some background, discuss two commonly used approaches, and introduce the concept of divergence. The second part will switch gears to hierarchical tree traversal in order��

]]> 1 ��˳��97caoporen��