Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, loops and array math intrinsics in Fortran. This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing: Using standard��
]]>Parallel Compiler Assisted Software Testing (PCAST) is a feature available in the NVIDIA HPC Fortran, C++, and C compilers. PCAST has two use cases. The first is testing changes to parts of a program, new compile-time flags, or a port to a new compiler or to a new processor. You might want to test whether a new library gives the same result, or test the safety of adding OpenMP parallelism��
]]>Fortran developers have long been able to accelerate their programs using CUDA Fortran or OpenACC. For more up-to-date information, please read Using Fortran Standard Parallel Programming for GPU Acceleration, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing. Now with the latest 20.11 release of the NVIDIA HPC SDK��
]]>Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that need to run on a variety of platforms, adapting library calls for each vendor or library version can be a maintenance nightmare. A compiler that can automatically generate calls to tuned math libraries gives you the best of both��
]]>The CUDA Fortran compiler from PGI now supports programming Tensor Cores with NVIDIA��s Volta V100 and Turing GPUs. This enables scientific programmers using Fortran to take advantage of FP16 matrix operations accelerated by Tensor Cores. Let��s take a look at how Fortran supports Tensor Cores. Tensor Cores offer substantial performance gains over typical CUDA GPU core programming on Tesla V100��
]]>Solar storms consist of massive explosions on the Sun that can release the energy of over 2 billion megatons of TNT in the form of solar flares and Coronal Mass Ejections (CMEs). CMEs eject billions of tons of magnetized plasma into space, and while most of them miss Earth entirely, there have been some in the past that would have inflicted great damage on our modern technological society had they��
]]>Russian scientists from Lomonosov Moscow State University used an ordinary GPU-accelerated desktop computer to solve complex quantum mechanics equations in just 15 minutes that would typically take two to three days on a large CPU-only supercomputer. Senior researchers Vladimir Pomerantcev and Olga Rubtsova and professor Vladimir Kukulin used a GeForce GTX 670 with CUDA and the PGI CUDA Fortran��
]]>New PGI compiler release includes support for C++ and Fortran applications to run in parallel on multi-core CPUs or GPU accelerators. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. With the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs.
]]>Programmability is crucial to accelerated computing, and NVIDIA��s CUDA Toolkit has been critical to the success of GPU computing. Over three million CUDA Toolkits have been downloaded since its first launch. However, there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have��
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I��ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our��
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In this post I��ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I��ll add just a few more lines to push��
]]>