The NVIDIA CUDA-X math libraries empower developers to build accelerated applications for AI, scientific computing, data processing, and more. Two of the most important applications of CUDA-X libraries are training and inference LLMs, whether for use in everyday consumer applications or highly specialized scientific domains like drug discovery. Multiple CUDA-X libraries are indispensable��
]]>NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3. FlashAttention-3 incorporates key techniques to achieve 1.5�C2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8��
]]>The latest release of NVIDIA cuBLAS library, version 12.5, continues to deliver functionality and performance to deep learning (DL) and high-performance computing (HPC) workloads. This post provides an overview of the following updates on cuBLAS matrix multiplications (matmuls) since version 12.0, and a walkthrough: Grouped GEMM APIs can be viewed as a generalization of the batched��
]]>The past decade has seen a remarkable surge in the adoption of deep learning techniques for computer vision (CV) tasks. Convolutional neural networks (CNNs) have been the cornerstone of this revolution, exhibiting exceptional performance and enabling significant advancements in visual perception. By employing localized filters and hierarchical architectures, CNNs have proven adept at��
]]>Learn how transformers are used as the building blocks of modern large language models in this new self-paced course.
]]>Stacking transformer layers to create large models results in better accuracies, few-shot learning capabilities, and even near-human emergent abilities on a wide range of language tasks. These foundation models are expensive to train, and they can be memory- and compute-intensive during inference (a recurring cost). The most popular large language models (LLMs) today can reach tens to hundreds of��
]]>Interested in developing LLM-based applications? Get started with this exploration of the open-source ecosystem.
]]>Explore new generative AI models from NVIDIA that will have a major impact on your vision AI developer stack.
]]>Vision Transformers (ViTs) are taking computer vision by storm, offering incredible accuracy, robust solutions for challenging real-world scenarios, and improved generalizability. The algorithms are playing a pivotal role in boosting computer vision applications and NVIDIA is making it easy to integrate ViTs into your applications using NVIDIA TAO Toolkit and NVIDIA L4 GPUs.
]]>Learn how Vision Transformers are revolutionizing AI applications with image understanding and analysis.
]]>Recent years have seen a proliferation of large language models (LLMs) that extend beyond traditional language tasks to generative AI. This includes models like ChatGPT and Stable Diffusion. As this generative AI focus continues to grow, there is a rising need for a modern machine learning (ML) infrastructure that makes scalability accessible to the everyday practitioner.
]]>The NVIDIA H100 Tensor Core GPU, based on the NVIDIA Hopper architecture with the fourth generation of NVIDIA Tensor Cores, recently debuted delivering unprecedented performance and sweeping AI benchmarks such as MLPerf training. A significant fraction of operations in AI and machine learning benchmarks are general matrix multiplications (GEMMS), which are also referred to as matmul��
]]>AI processing requires full-stack innovation across hardware and software platforms to address the growing computational demands of neural networks. A key area to drive efficiency is using lower precision number formats to improve computational efficiency, reduce memory usage, and optimize for interconnect bandwidth. To realize these benefits, the industry has moved from 32-bit precisions to��
]]>Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours of speech. Recent literature has renewed focus on more complex languages, such as Japanese. Like other Asian languages, Japanese has a vast base character set (upwards of 3,000 unique characters are used in common vernacular)��
]]>This is the first part of a two-part series discussing the NVIDIA Triton Inference Server��s FasterTransformer (FT) library, one of the fastest libraries for distributed inference of transformers of any size (up to trillions of parameters). It provides an overview of FasterTransformer, including the benefits of using the library. Join the NVIDIA Triton and NVIDIA TensorRT community to stay��
]]>This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to the FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates��
]]>As the size and complexity of large language models (LLMs) continue to grow, NVIDIA is today announcing updates to the NeMo framework that provide training speed-ups of up to 30%. These updates�Cwhich include two trailblazing techniques and a hyperparameter tool to optimize and scale training of LLMs on any number of GPUs�Coffer new capabilities to train and deploy models using the NVIDIA AI��
]]>Natural language processing (NLP) can be defined as the combination of artificial intelligence (AI), computer science, and computational linguistics to understand human communication and extract meaning from unstructured spoken or written material. NLP use cases for healthcare have increased in the last few years to accelerate the development of therapeutics and improve quality of patient��
]]>With the increasing demand for access to pretrained large language model (LLM) weights, the climate around LLM sharing is changing. Recently, Meta released Open Pretrained Transformer, a language model with 175 billion parameters. BigScience is on schedule to release its multilingual language model with 176 billion parameters in a few months. As more LLMs become available��
]]>At the Computer Vision and Pattern Recognition Conference (CVPR), NVIDIA researchers are presenting over 35 papers. This includes work on Shifted WINdows UNEt TRansformers (Swin UNETR)��the first transformer-based pretraining framework tailored for self-supervised tasks in 3D medical image analysis. The research is the first step in creating pretrained, large-scale, and self-supervised 3D models��
]]>Computer vision is a rapidly growing field in research and applications. Advances in computer vision research are now more directly and immediately applicable to the commercial world. AI developers are implementing computer vision solutions that identify and classify objects and even react to them in real time. Image classification, face detection, pose estimation, and optical flow are some��
]]>Big data, new algorithms, and fast computation are three main factors that make the modern AI revolution possible. However, data poses many challenges for enterprises: difficulty in data labeling, ineffective data governance, limited data availability, data privacy, and so on. Synthetically generated data is a potential solution to address these challenges because it generates data points by��
]]>Watch NVIDIA founder and CEO Jensen Huang��s GTC keynote address streaming on Nov. 9 and in replay. Tune in to a healthcare special address by Kimberly Powell, NVIDIA VP of healthcare, on Nov. 9 at 10:30 a.m. Pacific. Subscribe to NVIDIA healthcare news. NVIDIA Clara Holoscan is the AI computing platform for medical devices that combines hardware systems for low-latency sensor and network��
]]>Predictive maintenance is used for early fault detection, diagnosis, and prediction when maintenance is needed in various industries including oil and gas, manufacturing, and transportation. Equipment is continuously monitored to measure things like sound, vibration, and temperature to alert and report potential issues. To accomplish this in computers, the first step is to determine the root cause��
]]>MLPerf is an industry-wide AI consortium that has developed a suite of performance benchmarks covering a range of leading AI workloads that are widely in use today. The latest MLPerf v0.7 training submission includes vision, language, recommenders, and reinforcement learning. NVIDIA submitted MLPerf v0.7 training results for all eight tests and the NVIDIA platform set records in all��
]]>