In high-stakes fields such as quant finance, algorithmic trading, and fraud detection, data practitioners frequently need to process hundreds of gigabytes (GB) of data to make quick, informed decisions. Polars, one of the fastest-growing data processing libraries, meets this need with a GPU engine powered by NVIDIA cuDF that accelerates compute-bound queries that are common in these fields.
]]>NVIDIA cuPyNumeric is a library that aims to provide a distributed and accelerated drop-in replacement for NumPy built on top of the Legate framework. It brings zero-code-change scaling to multi-GPU and multinode (MGMN) accelerated computing. cuPyNumeric 25.03 is a milestone update that introduces powerful new capabilities and enhanced accessibility for users and developers alike��
]]>With the rise of physical AI, video content generation has surged exponentially. A single camera-equipped autonomous vehicle can generate more than 1 TB of video daily, while a robotics-powered manufacturing facility may produce 1 PB of data daily. To leverage this data for training and fine-tuning world foundation models (WFMs), you must first process it efficiently.
]]>Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for fine-tuning and pretraining generative AI models. Their value lies in enhancing data quality by filtering out low-quality or toxic data, ensuring only clean and relevant information feeds downstream processes. Beyond filtering��
]]>Quantum dynamics describes how complex quantum systems evolve in time and interact with their surroundings. Simulating quantum dynamics is extremely difficult yet critical for understanding and predicting the fundamental properties of materials. This is of particular importance in the development of quantum processing units (QPUs), where quantum dynamics simulations enable QPU developers to��
]]>In the realm of high-performance computing (HPC), NVIDIA has continually advanced HPC by offering its highly optimized NVIDIA High-Performance Conjugate Gradient (HPCG) benchmark program as part of the NVIDIA HPC benchmark program collection. We now provide the NVIDIA HPCG benchmark program in the /NVIDIA/nvidia-hpcg GitHub repo, using its high-performance math libraries, cuSPARSE��
]]>Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today��s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are simultaneously important. Even if a large��
]]>Due to the adoption of multicamera inputs and deep convolutional backbone networks, the GPU memory footprint for training autonomous driving perception models is large. Existing methods for reducing memory usage often result in additional computational overheads or imbalanced workloads. This post describes joint research between NVIDIA and NIO, a developer of smart electric vehicles.
]]>NVIDIA Holoscan for Media is a software-defined platform for building and deploying applications for live media. Recent updates introduce a user-friendly developer interface and new capabilities for application deployment to the platform. Holoscan for Media now includes Helm Dashboard, which delivers an intuitive user interface for orchestrating and managing Helm charts.
]]>This NVIDIA HPC SDK 23.9 update expands platform support and provides minor updates.
]]>The broadcast industry is undergoing a transformation in how content is created, managed, distributed, and consumed. This transformation includes a shift from traditional linear workflows bound by fixed-function devices to flexible and hybrid, software-defined systems that enable the future of live streaming. Developers can now apply to join the early access program for NVIDIA Holoscan for��
]]>As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient gradient-boosting framework that��s been widely adopted due to its speed and performance for large tabular data. Using multiple GPUs should theoretically provide a significant boost in computational power, resulting in faster model��
]]>Generative AI has become a transformative force of our era, empowering organizations spanning every industry to achieve unparalleled levels of productivity, elevate customer experiences, and deliver superior operational efficiencies. Large language models (LLMs) are the brains behind generative AI. Access to incredibly powerful and knowledgeable foundation models, like Llama and Falcon��
]]>When you see a context-relevant advertisement on a web page, it��s most likely content served by a Taboola data pipeline. As the leading content recommendation company in the world, a big challenge for Taboola was the frequent need to scale Apache Spark CPU cluster capacity to address the constantly growing compute and storage requirements. Data center capacity and hardware costs are always��
]]>This update expands platform support and provides minor updates.
]]>Real-time cloud-scale applications that involve AI-based computer vision are growing rapidly. The use cases include image understanding, content creation, content moderation, mapping, recommender systems, and video conferencing. However, the compute cost of these workloads is growing too, driven by demand for increased sophistication in the processing. The shift from still images to video is��
]]>GPUs continue to get faster with each new generation, and it is often the case that each activity on the GPU (such as a kernel or memory copy) completes very quickly. In the past, each activity had to be separately scheduled (launched) by the CPU, and associated overheads could accumulate to become a performance bottleneck. The CUDA Graphs facility addresses this problem by enabling multiple GPU��
]]>CO2 capture and storage technologies (CCS) catch CO2 from its production source, compress it, transport it through pipelines or by ships, and store it underground. CCS enables industries to massively reduce their CO2 emissions and is a powerful tool to help industrial manufacturers achieve net-zero goals. In many heavy industrial processes, greenhouse gas (GHG) emissions cannot be avoided in the��
]]>Version 23.3 expands platform support and provides minor updates to the NVIDIA HPC SDK.
]]>Version 23.1 of the NVIDIA HPC SDK introduces CUDA 12 support, fixes, and minor enhancements.
]]>Learn the basic concepts, implementations, and applications of graph neural networks (GNNs) in this new self-paced course from NVIDIA Deep Learning Institute.
]]>Learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU.
]]>Celebrating the SuperComputing 2022 international conference, NVIDIA announces the release of HPC Software Development Kit (SDK) v22.11. Members of the NVIDIA Developer Program can download the release now for free. The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools for high performance computing (HPC) developers. It provides everything developers need to��
]]>Learn how to train the largest of deep neural networks and deploy them to production.
]]>Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and efficiency. One of the most prevalent trends today is the use of CPUs based on Arm architecture to build data center servers. To ensure that these new systems are enterprise-ready and optimally configured, NVIDIA has approved the��
]]>This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to the FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates��
]]>Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, loops and array math intrinsics in Fortran. This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing: Using standard��
]]>High-performance computing (HPC) has become the essential instrument of scientific discovery. Whether it is discovering new, life-saving drugs, battling climate change, or creating accurate simulations of our world, these solutions demand an enormous��and rapidly growing��amount of processing power. They are increasingly out of reach of traditional computing approaches.
]]>It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less��
]]>The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by calling into an existing GPU-optimized library. This is, for example, when the building blocks of your simulation software consist of BLAS linear algebra functions, which can be accelerated using cuBLAS. This is the second post in the��
]]>Today, NVIDIA announces the release of cuFFTMp for Early Access (EA). cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms. FFTs (Fast Fourier Transforms) are widely used in a variety of fields, ranging from molecular dynamics, signal processing, computational fluid dynamics (CFD) to wireless��
]]>Single-cell genomics research continues to advance drug discovery for disease prevention. For example, it has been pivotal in developing treatments for the current COVID-19 pandemic, identifying cells susceptible to infection, and revealing changes in the immune systems of infected patients. However, with the growing availability of large-scale single-cell datasets, it��s clear that computing��
]]>Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems, different GPU pairs share communication links with varying bandwidth and latency. As a result, GPU assignment can substantially impact time to solution. Furthermore, on multi-node / multi-socket systems, communication performance can degrade��
]]>This is the second post in the Accelerating IO series, which describes the architecture, components, and benefits of Magnum IO, the IO subsystem of the modern data center. The first post in this series introduced the Magnum IO architecture and positioned it in the broader context of CUDA, CUDA-X, and vertical application domains. Of the four major components of the architecture��
]]>This is the first post in the Accelerating IO series, which describes the architecture, components, storage, and benefits of Magnum IO, the IO subsystem of the modern data center. Previously the boundary of the unit of computing, sheet metal no longer constrains the resources that can be applied to a single problem or the data set that can be housed. The new unit is the data center.
]]>With the introduction of hardware-accelerated ray tracing in NVIDIA Turing GPUs, game developers have received an opportunity to significantly improve the realism of gameplay rendered in real time. In these early days of real-time ray tracing, most implementations are limited to one or two ray-tracing effects, such as shadows or reflections. If the game��s geometry and materials are simple enough��
]]>Deep neural network (DNN) development for self-driving cars is a demanding workload. In this post, we validate DGX multi-node, multi-GPU, distributed training running on RedHat OpenShift in the DXC Robotic Drive environment. We used OpenShift 3.11, also a part of the Robotic Drive containerized compute platform, to orchestrate and execute the deep learning (DL) workloads.
]]>Gone are the days of using a single GPU to train a deep learning model. With computationally intensive algorithms such as semantic segmentation, a single GPU can take days to optimize a model. But multi-GPU hardware is expensive, you say. Not any longer; NVIDIA multi-GPU hardware on cloud instances like the AWS P3 allow you to pay for only what you use. Cloud instances allow you to take��
]]>Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor��
]]>NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose��
]]>Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (��nodes��) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC)��
]]>I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I will demonstrate the performance of MPI through both synthetic and realistic benchmarks. Since you now know why CUDA-aware MPI is more efficient from a theoretical perspective, let��s take a look at the results of MPI bandwidth and��
]]>MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel��
]]>