The HPC SDK v25.3 release includes support for NVIDIA Blackwell GPUs and an optimized allocator for Arm CPUs.
]]>The previous post How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism demonstrated how to write a Black-Scholes simulation using ISO C++ standard parallelism with the code found in the /NVIDIA/accelerated-quant-finance GitHub repo. This approach enables you to productively write code that is both concise and portable. Using solely standard C++, it��s possible to write an��
]]>Quantitative finance libraries are software packages that consist of mathematical, statistical, and, more recently, machine learning models designed for use in quantitative investment contexts. They contain a wide range of functionalities, often proprietary, to support the valuation, risk management, construction, and optimization of investment portfolios. Financial firms that develop such��
]]>This NVIDIA HPC SDK update includes the cuBLASMp preview library, along with minor bug fixes and enhancements.
]]>On December 7, learn how to verify OpenACC implementations across compilers and system architectures with the validation testsuite.
]]>The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that the user can develop their application for both processors while using a single, unified address space.
]]>Computational energy efficiency has become a primary decision criterion for most supercomputing centers. Data centers, once built, are capped in terms of the amount of power they can use without expensive and time-consuming retrofits. Maximizing insight in the form of workload throughput then means maximizing workload per watt. NVIDIA products have, for several generations��
]]>This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.
]]>NVIDIA HPC SDK version 23.7 is now available and provides minor updates and enhancements.
]]>Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline.
]]>This update expands platform support and provides minor updates.
]]>Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.
]]>Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.
]]>Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA Developer Program can download the release now for free. The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools for high performance computing (HPC) developers. It provides everything developers need to��
]]>You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are named not for leaders or civilizations but for the materials that defined them: Stone Age, Bronze Age, and so on. The current digital or information age could be renamed the Silicon or Semiconductor Age and retain the same meaning.
]]>This version 22.9 update to the NVIDIA HPC SDK includes fixes and minor enhancements.
]]>Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.
]]>This release includes enhancements, fixes, and new support for Arm SVE, Rocky Linux OS, and Amazon EC2 C7g instances, powered by the latest generation AWS Graviton3 processors.
]]>The past decade has seen quantum computing leap out of academic labs into the mainstream. Efforts to build better quantum computers proliferate at both startups and large companies. And while it is still unclear how far we are away from using quantum advantage on common problems, it is clear that now is the time to build the tools needed to deliver valuable quantum applications. To start��
]]>The inline processing of network packets using GPUs is a packet-analysis technique useful to a number of different application domains: signal processing, network security, information gathering, input reconstruction, and so on. The main requirement of these application types is to move received packets into GPU memory as soon as possible, to trigger the CUDA kernel responsible to execute��
]]>It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less��
]]>The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by calling into an existing GPU-optimized library. This is, for example, when the building blocks of your simulation software consist of BLAS linear algebra functions, which can be accelerated using cuBLAS. This is the second post in the��
]]>The NVIDIA platform is the most mature and complete platform for accelerated computing. In this post, I address the simplest, most productive, and most portable approach to accelerated computing. This is the first post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing��
]]>CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use NVIDIA GPUs. CUDA is ideal for diverse workloads from high performance computing, data science analytics, and AI applications. The latest release, CUDA 11.3, and its features are focused on enhancing the programming model and performance of��
]]>CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every NVIDIA GPU platform for general purpose compute acceleration. The latest CUDA release, CUDA 11.2, is focused on improving the user experience and application performance for CUDA developers. CUDA 11.2��
]]>Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that need to run on a variety of platforms, adapting library calls for each vendor or library version can be a maintenance nightmare. A compiler that can automatically generate calls to tuned math libraries gives you the best of both��
]]>This is the third post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers. Ease of programming and a giant leap in performance is one of the key reasons for the CUDA platform��s widespread adoption. The second biggest reason for the success of the CUDA platform is the availability of a broad and��
]]>This is the second post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers. Advancements in science and business drive an insatiable demand for more computing resources and acceleration of workloads. Parallel programming is a profound way for developers to accelerate their applications. However��
]]>In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups��
]]>R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the��
]]>Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best��
]]>NVIDIA��s Tegra K1 (TK1) is the first Arm system-on-chip (SoC) with integrated CUDA. With 192 Kepler GPU cores and four Arm Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM). This brings game-changing performance to low-SWaP (Size��
]]>cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API: Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated��
]]>Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory��
]]>In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part��
]]>In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while uniform grid relies on too many simplifying assumptions that��
]]>This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using broad-phase collision detection as an example. The first part will give some background, discuss two commonly used approaches, and introduce the concept of divergence. The second part will switch gears to hierarchical tree traversal in order��
]]>