Let��s imagine a situation. You buy a brand-new, cutting-edge, Volta-powered DGX-2 server. You��ve done your math right, expecting a 2x performance increase in ResNet50 training over the DGX-1 you had before. You plug it into your rack cabinet and run the training. That��s when an unpleasant surprise pops up. Even though your math is correct, the speedup you��re getting lower than expected. Why?
]]>Machine learning harnesses computing power to solve a variety of ��hard�� problems that seemed impossible to program using traditional languages and techniques.?Machine learning?avoids?the need for a programmer to explicitly program the steps in solving a complex pattern-matching problem such as understanding speech or recognizing objects within an image. NVIDIA aims to bring machine learning to��
]]>The CUDA Fortran compiler from PGI now supports programming Tensor Cores with NVIDIA��s Volta V100 and Turing GPUs. This enables scientific programmers using Fortran to take advantage of FP16 matrix operations accelerated by Tensor Cores. Let��s take a look at how Fortran supports Tensor Cores. Tensor Cores offer substantial performance gains over typical CUDA GPU core programming on Tesla V100��
]]>Gone are the days of using a single GPU to train a deep learning model. With computationally intensive algorithms such as semantic segmentation, a single GPU can take days to optimize a model. But multi-GPU hardware is expensive, you say. Not any longer; NVIDIA multi-GPU hardware on cloud instances like the AWS P3 allow you to pay for only what you use. Cloud instances allow you to take��
]]>Neural networks with thousands of layers and millions of neurons demand high performance and faster training times. The complexity and size of neural networks continue to grow. Mixed-precision training using Tensor Cores on Volta and Turing architectures enable higher performance while maintaining network accuracy for heavily compute- and memory-intensive Deep Neural Networks (DNNs).
]]>Double-precision floating point (FP64) has been the de facto standard for doing scientific simulation for several decades. Most numerical methods used in engineering and scientific applications require the extra precision to compute correct answers or even reach an answer. However, FP64 also requires more computing resources and runtime to deliver the increased precision levels.
]]>NVIDIA CEO Jensen Huang described the NVIDIA? DGX-2 server as ��the world��s largest GPU�� at its launch during GPU Technology Conference earlier this year. DGX-2 comprises 16 NVIDIA Tesla V100 32 GB GPUs and other top-drawer components (two 24 core Xeon CPUs, 1.5 TB of DDR4 DRAM memory, and 30 TB of NVMe storage) in a single system, delivering two petaFLOPS of performance, qualifying it as one of��
]]>Today at the Computer Vision and Pattern Recognition Conference in Salt Lake City, Utah, NVIDIA is kicking off the conference by demonstrating an early release of Apex, an open-source PyTorch extension that helps users maximize deep learning training performance on NVIDIA Volta GPUs. Inspired by state-of-the-art mixed precision training in translational networks, sentiment analysis��
]]>Today the world of open science received its greatest asset in the form of the Summit supercomputer at Oak Ridge National Laboratory (ORNL). This represents an historic milestone because it is the world��s first supercomputer fusing high performance, data-intensive, and AI computing into one system. Summit is capable of delivering a peak 200 petaflops, ten times faster than its Titan predecessor��
]]>Three big NVIDIA Nsight releases on the same day! NSight Systems is a brand new optimization tool; Nsight Visual Studio Edition 5.6 extends support to Volta GPUs and Win10 RS4; and NSight GRAPHICS 1.2 replaces the current Linux Graphics Debugger. NVIDIA Nsight Systems is a low overhead performance analysis tool designed to provide insights developers need to optimize their software.
]]>Researchers at NVIDIA open-sourced v0.2 of OpenSeq2Seq �C a new toolkit built on top of TensorFlow for training sequence-to-sequence models. OpenSeq2Seq provides researchers with optimized implementation of various sequence-to-sequence models commonly used for applications such as machine translation and speech recognition. OpenSeq2Seq is performance optimized for mixed-precision training using��
]]>NVIDIA Nsight Visual Studio Edition 5.5 is now available for download in the NVIDIA Registered Developer Program. This release extends support to the latest Volta GPUs and Win10 RS3. The Graphics Debugger adds Pixel History (DirectX 11, OpenGL) and OpenVR 1.0.10 support as well as Vulkan and Range Profiler improvements. Nsight Visual Studio Edition version 5.5 also introduces new compute tools��
]]>NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for production deployment. In this post we��ll introduce TensorRT 3, which improves performance versus previous versions and includes new��
]]>Many of today��s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times or have a high flops/byte ratio. Many real-world codes have to��
]]>A defining feature of the new NVIDIA Volta GPU architecture is Tensor Cores, which give the NVIDIA V100 accelerator a peak throughput that is 12x the 32-bit floating point throughput of the previous-generation NVIDIA P100. Tensor Cores enable you to use mixed-precision for higher throughput without sacrificing accuracy. Tensor Cores provide a huge boost to convolutions and matrix operations.
]]>Deep Neural Networks (DNNs) have lead to breakthroughs in a number of areas, including image processing and understanding, language modeling, language translation, speech processing, game playing, and many others. DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. Mixed-precision training lowers the��
]]>Modern deep neural networks, such as those used in self-driving vehicles, require a mind boggling amount of computational power. Today a single computer, like NVIDIA DGX-1, can achieve computational performance on par with the world��s biggest supercomputers in the year 2010 (��Top 500��, 2010). Even though this technological advance is unprecedented, it is being dwarfed by the computational hunger��
]]>Previously known as CNTK, the Microsoft Cognitive Toolkit version 2.0 allows developers to create, train, and evaluate their own neural networks that can scale across multiple GPUs and multiple machines on massive data sets. The open-source toolkit, available on GitHub, offers hundreds of new features, performance improvements and fixes that have been added since the beta version of CNTK.
]]>