Pharmaceutical research demands fast, efficient simulations to predict how molecules interact, speeding up drug discovery. Jiqun Tu, a senior developer technology engineer at NVIDIA, and Ellery Russell, tech lead for the Desmond engine at Schr?dinger, explore advanced GPU optimization techniques designed to accelerate molecular dynamics simulations. In this NVIDIA GTC 2024 session��
]]>Quantitative finance libraries are software packages that consist of mathematical, statistical, and, more recently, machine learning models designed for use in quantitative investment contexts. They contain a wide range of functionalities, often proprietary, to support the valuation, risk management, construction, and optimization of investment portfolios. Financial firms that develop such��
]]>The latest release of NVIDIA Omniverse delivers an exciting collection of new features based on Omniverse Kit 105, making it easier than ever for developers to get started building 3D simulation tools and workflows. Built on Universal Scene Description, known as OpenUSD, and NVIDIA RTX and AI technologies, Omniverse enables you to create advanced, real-time 3D simulation applications for��
]]>Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill sets and expertise to reveal the underlying problem. Yet projects often require using multiple languages, to ensure high performance where necessary, a user-friendly experience, and compatibility where possible. Unfortunately��
]]>Decades of computer science history have been devoted to devising solutions for efficient storage and retrieval of information. Hash maps (or hash tables) are a popular data structure for information storage given their amortized, constant-time guarantees for the insertion and retrieval of elements. However, despite their prevalence, hash maps are seldom discussed in the context of GPU��
]]>Learn how to write scalable GPU-accelerated hybrid applications using C++ standard language features alongside MPI in this interactive hands-on self-paced course.
]]>The computational needs for AI processing of sensor streams at the edge are increasingly demanding. Edge devices must keep up with high rates of incoming data streams, processing, displaying, archiving, and streaming results or closing a control loop in real time. This requires powerful, efficient, and accurate hardware and software solutions capable of high performance computing.
]]>Learn how to write simple, portable, parallel-first GPU-accelerated applications using only C++ standard language features in this self-paced course from the NVIDIA Deep Learning Institute��
]]>There are three main ways to accelerate GPU applications: compiler directives, programming languages, and preprogrammed libraries. Compiler directives such as OpenACC aIlow you to smoothly port your code to the GPU for acceleration with a directive-based programming model. While it is simple to use, it may not provide optimal performance in certain scenarios. Programming languages such as��
]]>NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user interface and a command-line tool. Nsight Compute 2022.2 includes features to expand the supported environments and workflows for CUDA kernel profiling and optimization. Download now. >> The following outlines the feature highlights of��
]]>Rob Smallshire once said, ��You can write faster code in C++, but write code faster in Python.�� Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs. More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries��
]]>The Nsight suite of Developer Tools provide insightful tracing, debugging, profiling, and other analyses to optimize your complex computational applications across NVIDIA GPUs, and CPUs including x86, Arm, and Power architectures. NVIDIA Nsight Systems is a performance analysis tool designed to visualize, analyze and optimize programming models, and tune to scale efficiently across any��
]]>Today NVIDIA announced Nsight DL Designer �C the first in-class integrated development environment to support efficient design of deep neural networks for in-app inference. Download Now! This tool aims at streamlining the often iterative process of designing deep neural network models for in-app inferencing by providing efficient support at every stage of the process. Nsight DL Designer is a��
]]>As PyData leverages much of the static language world for speed including CUDA, we need tools which not only profile and measure across languages but also devices, CPU, and GPU. While there are many great profiling tools within the Python ecosystem: line-profilers like cProfile and profilers which can observe code execution in C-extensions like PySpy/Viztracer. None of the Python profilers can��
]]>When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough��
]]>Parallel Compiler Assisted Software Testing (PCAST) is a feature available in the NVIDIA HPC Fortran, C++, and C compilers. PCAST has two use cases. The first is testing changes to parts of a program, new compile-time flags, or a port to a new compiler or to a new processor. You might want to test whether a new library gives the same result, or test the safety of adding OpenMP parallelism��
]]>Historically, accelerating your C++ code with GPUs has not been possible in Standard C++ without using language extensions or additional libraries: In many cases, the results of these ports are worth the effort. But what if you could get the same effect without that cost? What if you could take your Standard C++ code and accelerate on a GPU? Now you can!
]]>Update, May 9, 2018: TensorFlow v1.7 and above integrates with TensorRT 3.0.4. NVIDIA is working on supporting the integration for a wider set of configurations and versions. We��ll publish updates when these become available. Meanwhile, if you��re using , simply download TensorRT files for Ubuntu 14.04 not16.04, no matter what version of Ubuntu you��re running.
]]>Hybridizer is a compiler from Altimesh that lets you program GPUs and other accelerators from C# code or .NET Assembly. Using decorated symbols to express parallelism, Hybridizer generates source code or binaries optimized for multicore CPUs and GPUs. In this blog post we illustrate the CUDA target. Figure 1 shows the Hybridizer compilation pipeline. Using parallelization patterns such as��
]]>Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview release described in the blog post below. We have decomposed the structure of the GEMM computation into deeper, structured primitives for loading data, computing predicate masks, streaming data at each level of the GEMM hierarchy��
]]>A team of researchers from Seoul National University built a pedestrian-following service robot to drive smart shopping carts and other autonomous helpers. The performance of their initial CPU-only implementation was ��not acceptable for a real-time system�� so they now use a GPU-accelerated CUDA implementation for a 13x performance boost. Mobile robots that track a person have received��
]]>Wired discusses Google��s announcement that it is open sourcing its TensorFlow machine learning system �C noting the system uses GPUs to both train and run artificial intelligence services at the company. Inside Google, when tackling tasks like image recognition and speech recognition and language translation, TensorFlow depends on machines equipped with GPUs that were originally designed to render��
]]>New PGI compiler release includes support for C++ and Fortran applications to run in parallel on multi-core CPUs or GPU accelerators. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. With the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs.
]]>The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11��
]]>Cloud computing is all about making resources available on demand, and its availability, flexibility, and lower cost has helped it take commercial computing by storm. At the Microsoft Build 2015 conference in San Francisco Microsoft revealed that its AzureC cloud computing platform is averaging over 90 thousand new customers per month; contains more than 1.4 million SQL databases being used by��
]]>Ever looked up in the sky and wondered where it all came from? Cosmologists are in the same boat, trying to understand how the Universe arrived at the structure we observe today. They use supercomputers to follow the fate of very small initial fluctuations in an otherwise uniform density. As time passes, gravity causes the small fluctuations to grow, eventually forming the complex structures that��
]]>Programmability is crucial to accelerated computing, and NVIDIA��s CUDA Toolkit has been critical to the success of GPU computing. Over three million CUDA Toolkits have been downloaded since its first launch. However, there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have��
]]>Today I��m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic��
]]>Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to��
]]>Today I��m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone! CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won��t repeat myself��
]]>The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple ��array�� object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.
]]>Often cited as the main reason that na?ve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance. Two pointers alias if the memory to which they point��
]]>CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor��
]]>With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be��
]]>In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s learn��
]]>In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important. CUDA performance measurement is��
]]>This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I��
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I��ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our��
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In this post I��ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I��ll add just a few more lines to push��
]]>