A team of Stanford University researchers has developed an LLM system to cut through bureaucratic red tape. The LLM��dubbed the System for Statutory Research, or STARA��can help policymakers quickly and cheaply parse voluminous collections of rules to identify laws that are redundant, outdated, or overly onerous. Ultimately, it can make governments more efficient, the researchers say.
]]>Modern high-performance computing (HPC) is enabling more than just quick calculations �� it��s powering AI systems that are unlocking scientific breakthroughs. HPC has gone through many iterations, each sparked by a creative repurposing of technologies. For example, early supercomputers used off-the-shelf components. Researchers later built powerful clusters from personal computers and even��
]]>Conservationists have launched a new AI tool that can sift through petabytes of underwater imaging from anywhere in the world to identify signs of abandoned or lost fishing nets��so-called ghost nets. Each year, around 2% of the world��s fishing gear��including roughly 80,000 square kilometers of fishing nets��is lost in the oceans. Those nets threaten marine wildlife like seals, turtles��
]]>With the growth of large language models (LLMs), deep learning is advancing both model architecture design and computational efficiency. Mixed precision training, which strategically employs lower precision formats like brain floating point 16 (BF16) for computationally intensive operations while retaining the stability of 32-bit floating-point (FP32) where needed, has been a key strategy for��
]]>Generative AI has revolutionized how people bring ideas to life, and agentic AI represents the next leap forward in this technological evolution. By leveraging sophisticated, autonomous reasoning and iterative planning, AI agents can tackle complex, multistep problems with remarkable efficiency. As AI continues to revolutionize industries, the demand for running AI models locally has surged.
]]>Generative AI is unlocking new computing applications that greatly augment human capability, enabled by continued model innovation. Generative AI models��including large language models (LLMs)��are used for crafting marketing copy, writing computer code, rendering detailed images, composing music, generating videos, and more. The amount of compute required by the latest models is immense and��
]]>The most exciting computing applications currently rely on training and running inference on complex AI models, often in demanding, real-time deployment scenarios. High-performance, accelerated AI platforms are needed to meet the demands of these applications and deliver the best user experiences. New AI models are constantly being invented to enable new capabilities��
]]>Today during the 2022 NVIDIA GTC Keynote address, NVIDIA CEO Jensen Huang introduced the new NVIDIA H100 Tensor Core GPU based on the new NVIDIA Hopper GPU architecture. This post gives you a look inside the new H100 GPU and describes important new features of NVIDIA Hopper architecture GPUs. The NVIDIA H100 Tensor Core GPU is our ninth-generation data center GPU designed to deliver an��
]]>Today, NVIDIA is enabling developers to explore and evaluate experimental AI models for Deep Learning Super Sampling (DLSS). Developers can download experimental Dynamic-link libraries (DLLs), test how the latest DLSS research enhances their games, and provide feedback for future improvements. NVIDIA DLSS is a deep learning neural network that boosts frame rates and generates beautiful��
]]>This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. When deploying a neural network, it��s useful to think about how the network could be made to run faster or take less space. A more efficient network can make better��
]]>DLSS is a deep learning, super-resolution network that boosts frame rates by rendering fewer pixels and then using AI to construct sharp, higher-resolution images. Dedicated computational units on NVIDIA RTX GPUs called Tensor Cores accelerate the AI calculations, allowing the algorithm to run in real time. DLSS pairs perfectly with computationally intensive rendering algorithms such as real-time��
]]>NVIDIA Ampere GPU architecture introduced the third generation of Tensor Cores, with the new TensorFloat32 (TF32) mode for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture. It brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model scripts.
]]>Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that need to run on a variety of platforms, adapting library calls for each vendor or library version can be a maintenance nightmare. A compiler that can automatically generate calls to tuned math libraries gives you the best of both��
]]>The NVIDIA A100, based on the NVIDIA Ampere GPU architecture, offers a suite of exciting new features: third-generation Tensor Cores, Multi-Instance GPU (MIG) and third-generation NVLink. Ampere Tensor Cores introduce a novel math mode dedicated for AI training: the TensorFloat-32 (TF32). TF32 is designed to accelerate the processing of FP32 data types, commonly used in DL workloads.
]]>Organizations of all kinds are incorporating AI into their research, development, product, and business processes. This helps them meet and exceed their particular goals, and also helps them gain experience and knowledge to take on even bigger challenges. However, traditional compute infrastructures aren��t suitable for AI due to slow CPU architectures and varying system requirements for different��
]]>Today, during the 2020 NVIDIA GTC keynote address, NVIDIA founder and CEO Jensen Huang introduced the new NVIDIA A100 GPU based on the new NVIDIA Ampere GPU architecture. This post gives you a look inside the new A100 GPU, and describes important new features of NVIDIA Ampere architecture GPUs. The diversity of compute-intensive applications running in modern cloud data centers has driven��
]]>Medical image segmentation is a hot topic in the deep learning community. Proof of that is the number of challenges, competitions, and research projects being conducted in this area, which only rises year over year. Among all the different approaches to this problem, U-Net has become the backbone of many of the top-performing solutions for both 2D and 3D segmentation tasks. This is due to its��
]]>As more and more deep learning models are being deployed into production environments, there is a growing need for a separation between the work on the model itself, and the work of integrating it into a production pipeline. Windows ML caters to this demand by addressing efficient deployment of pretrained deep learning models into Windows applications. Developing and training the model itself��
]]>Every year, clever researchers introduce ever more complex and interesting deep learning models to the world. There is of course a big difference between a model that works as a nice demo in isolation and a model that performs a function within a production pipeline. This is particularly pertinent to creative apps where generative models must run with low latency to generate or enhance image��
]]>Sign up for the latest Speech AI News from NVIDIA. This post, intended for developers with professional level understanding of deep learning, will help you produce a production-ready, AI, text-to-speech model. Converting text into high quality, natural-sounding speech in real time has been a challenging conversational AI task for decades. State-of-the-art speech synthesis models are based on��
]]>Large scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought about exciting leaps in state-of-the-art accuracy for many natural language understanding (NLU) tasks. Since its release in Oct 2018, BERT1 (Bidirectional Encoder Representations from Transformers) remains one of the most popular language models and still delivers state of the art accuracy at the time of writing2.
]]>Object detection remains the primary driver for applications such as autonomous driving and intelligent video analytics. Object detection applications require substantial training using vast datasets to achieve high levels of accuracy. NVIDIA GPUs excel at the parallel compute performance required to train large networks in order to generate datasets for object detection inference.
]]>Our most popular question is ��What can I do to get great GPU performance for deep learning?�� We��ve recently published a detailed Deep Learning Performance Guide to help answer this question. The guide explains how GPUs process data and gives tips on how to design networks for better performance. We also take a close look at Tensor Core optimization to help improve performance. This post takes a��
]]>Machine learning harnesses computing power to solve a variety of ��hard�� problems that seemed impossible to program using traditional languages and techniques.?Machine learning?avoids?the need for a programmer to explicitly program the steps in solving a complex pattern-matching problem such as understanding speech or recognizing objects within an image. NVIDIA aims to bring machine learning to��
]]>The CUDA Fortran compiler from PGI now supports programming Tensor Cores with NVIDIA��s Volta V100 and Turing GPUs. This enables scientific programmers using Fortran to take advantage of FP16 matrix operations accelerated by Tensor Cores. Let��s take a look at how Fortran supports Tensor Cores. Tensor Cores offer substantial performance gains over typical CUDA GPU core programming on Tesla V100��
]]>Whether to employ mixed precision to train your TensorFlow models is no longer a tough decision. NVIDIA��s Automatic Mixed Precision (AMP) feature for TensorFlow, recently announced at the 2019 GTC, features automatic mixed precision training by making all the required model and optimizer adjustments internally within TensorFlow with minimal programmer intervention.
]]>Neural networks with thousands of layers and millions of neurons demand high performance and faster training times. The complexity and size of neural networks continue to grow. Mixed-precision training using Tensor Cores on Volta and Turing architectures enable higher performance while maintaining network accuracy for heavily compute- and memory-intensive Deep Neural Networks (DNNs).
]]>Double-precision floating point (FP64) has been the de facto standard for doing scientific simulation for several decades. Most numerical methods used in engineering and scientific applications require the extra precision to compute correct answers or even reach an answer. However, FP64 also requires more computing resources and runtime to deliver the increased precision levels.
]]>Fueled by the ongoing growth of the gaming market and its insatiable demand for better 3D graphics, NVIDIA? has evolved the GPU into the world��s leading parallel processing engine for many computationally-intensive applications. In addition to rendering highly realistic and immersive 3D games, NVIDIA GPUs also accelerate content creation workflows, high performance computing (HPC) and datacenter��
]]>Mixed-Precision combines different numerical precisions in a computational method. Using precision lower than FP32 reduces memory usage, allowing deployment of larger neural networks. Data transfers take less time, and compute performance increases, especially on NVIDIA GPUs with Tensor Core support for that precision. Mixed-precision training of DNNs achieves two main objectives: This video��
]]>Neural network models have quickly taken advantage of NVIDIA Tensor Cores for deep learning since their introduction in the Tesla V100 GPU last year. For example, new performance records for ResNet50 training were announced recently with Tensor Core-based solutions. (See the NVIDIA developer post on new performance milestones for additional details). NVIDIA��s cuDNN library enables CUDA��
]]>NVIDIA has released TensorRT 4 at CVPR 2018. This new version of TensorRT, NVIDIA��s powerful inference optimizer and runtime engine provides: Additional features include the ability to execute custom neural network layers using FP16 precision and support for the Xavier SoC through NVIDIA DRIVE AI platforms. TensorRT 4 speeds up deep learning inference applications such as neural machine��
]]>Artificial intelligence powered by deep learning now solves challenges once thought impossible, such as computers understanding and conversing in natural speech and autonomous driving. Inspired by the effectiveness of deep learning to solve a great many challenges, the exponentially growing complexity of algorithms has resulted in a voracious appetite for faster computing.
]]>A defining feature of the new NVIDIA Volta GPU architecture is Tensor Cores, which give the NVIDIA V100 accelerator a peak throughput that is 12x the 32-bit floating point throughput of the previous-generation NVIDIA P100. Tensor Cores enable you to use mixed-precision for higher throughput without sacrificing accuracy. Tensor Cores provide a huge boost to convolutions and matrix operations.
]]>Deep Neural Networks (DNNs) have lead to breakthroughs in a number of areas, including image processing and understanding, language modeling, language translation, speech processing, game playing, and many others. DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. Mixed-precision training lowers the��
]]>At the 2017 GPU Technology Conference NVIDIA announced CUDA 9, the latest version of CUDA��s powerful parallel computing platform and programming model. CUDA 9 is now available as a free download. In this post I��ll provide an overview of the awesome new features of CUDA 9. The CUDA Toolkit version 9.0 is available as a free download. To learn more you can watch the recording of my talk from GTC��
]]>