Quantitative developers need to run back-testing simulations to see how financial algorithms perform from a profit and loss (P&L) standpoint. Statistical techniques are important to visualize the possible outcomes of the algorithms in terms of the possible P&L paths. GPUs can greatly reduce the amount of time needed to do this. In the broader picture, mathematical modeling of financial��
]]>Generative physical AI models can understand and execute actions with fine or gross motor skills within the physical world. Understanding and navigating in the 3D space of the physical world requires spatial intelligence. To achieve spatial intelligence in physical AI involves converting the real world into AI-ready virtual representations that the model can understand.
]]>In the first part of the series, we presented an overview of the IVF-PQ algorithm and explained how it builds on top of the IVF-Flat algorithm, using the Product Quantization (PQ) technique to compress the index and support larger datasets. In this part two of the IVF-PQ post, we cover the practical aspects of tuning IVF-PQ performance. It��s worth noting again that IVF-PQ uses a lossy��
]]>In this post, we continue the series on accelerating vector search using NVIDIA cuVS. Our previous post in the series introduced IVF-Flat, a fast algorithm for accelerating approximate nearest neighbors (ANN) search on GPUs. We discussed how using an inverted file index (IVF) provides an intuitive way to reduce the complexity of the nearest neighbor search by limiting it to only a small subset of��
]]>K-means is a clustering algorithm��one of the simplest and most popular unsupervised machine learning (ML) algorithms for data scientists.
]]>Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models. Visual language models have evolved significantly recently. However, the existing technology typically only supports one single image. They cannot reason among multiple images, support in context learning or understand videos. Also, they don��t optimize for inference speed. We developed VILA��
]]>Mixture of experts (MoE) large language model (LLM) architectures have recently emerged, both in proprietary LLMs such as GPT-4, as well as in community models with the open-source release of Mistral Mixtral 8x7B. The strong relative performance of the Mixtral model has raised much interest and numerous questions about MoE and its use in LLM architectures. So, what is MoE and why is it important?
]]>While part 1 focused on the usage of the new NVIDIA cuTENSOR 2.0 CUDA math library, this post introduces a variety of usage modes beyond that, specifically usage from Python and Julia. We also demonstrate the performance of cuTENSOR based on benchmarks in a number of application domains. This post explores applications and performance benchmarks for cuTENSOR 2.0. For more information��
]]>NVIDIA cuTENSOR is a CUDA math library that provides optimized implementations of tensor operations where tensors are dense, multi-dimensional arrays or array slices. The release of cuTENSOR 2.0 represents a major update��in both functionality and performance��over its predecessor. This version reimagines its APIs to be more expressive, including advanced just-in-time compilation capabilities all��
]]>Meta, NetworkX, Fast.ai, and other industry leaders share how to gain new insights from your data with emerging tools.
]]>Performing an exhaustive exact k-nearest neighbor (kNN) search, also known as brute-force search, is expensive, and it doesn��t scale particularly well to larger datasets. During vector search, brute-force search requires the distance to be calculated between every query vector and database vector. For the frequently used Euclidean and cosine distances, the computation task becomes equivalent to a��
]]>In this post, we dive deeper into each of the GPU-accelerated indexes mentioned in part 1 and give a brief explanation of how the algorithms work, along with a summary of important parameters to fine-tune their behavior. We then go through a simple end-to-end example to demonstrate cuVS�� Python APIs on a question-and-answer problem with a pretrained large language model and provide a��
]]>In the current AI landscape, vector search is one of the hottest topics due to its applications in large language models (LLM) and generative AI. Semantic vector search enables a broad range of important tasks like detecting fraudulent transactions, recommending products to users, using contextual information to augment full-text searches, and finding actors that pose potential security risks.
]]>Read this tutorial on how to tap into GPUs by importing cuDF instead of pandas�Cwith only a few code changes.
]]>Modeling time series data can be challenging (and fascinating) due to its inherent complexity and unpredictability. Long-term trends in time series can change drastically due to certain events, for example. Recall the beginning of the global pandemic, when businesses such as airlines or brick-and-mortar shops saw a quick decline in the number of customers and sales. In contrast��
]]>Heterogeneous computing architectures��those that incorporate a variety of processor types working in tandem��have proven extremely valuable in the continued scalability of computational workloads in AI, machine learning (ML), quantum physics, and general data science. Critical to this development has been the ability to abstract away the heterogeneous architecture and promote a framework that��
]]>If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver insights quickly with breakthrough performance. From faster computation to efficient model training, GPUs bring many benefits to everyday ML tasks. Update: The below blog describes how to use GPU-only RAPIDS cuDF��
]]>Read about an innovative GPU solution that solves limitations using small biased datasets with RAPIDS cuDF.
]]>Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and epigenome on a cell level has enabled researchers to gain valuable new insights. As a result, single-cell experiments have grown in size and complexity by a factor of over 100, with experiments involving more than 1 million cells becoming��
]]>Reconstructing a smooth surface from a point cloud is a fundamental step in creating digital twins of real-world objects and scenes. Algorithms for surface reconstruction appear in various applications, such as industrial simulation, video game development, architectural design, medical imaging, and robotics. Neural Kernel Surface Reconstruction (NKSR) is the new NVIDIA algorithm for��
]]>In the high-frequency trading world, thousands of market participants interact daily. In fact, high-frequency trading accounts for more than half of the US equity trading volume, according to the paper High-Frequency Trading Synchronizes Prices in Financial Markets. Market makers are the big players on the sell side who provide liquidity in the market. Speculators are on the buy side��
]]>QHack is an educational conference and the world��s largest quantum machine learning (QML) hackathon. This year at QHack 2023, 2,850 individuals from 105 different countries competed for 8 days to build the most innovative solutions for quantum computing applications using NVIDIA quantum technology. The event was organized by Xanadu, with NVIDIA sponsoring the QHack 2023 NVIDIA Challenge.
]]>Linear regression is a powerful statistical tool used to model the relationship between a dependent variable and one or more independent variables (features). An important, and often forgotten, concept in regression analysis is that of interaction terms. In short, interaction terms enable you to examine whether the relationship between the target and the independent variable changes depending on��
]]>As a data scientist, evaluating machine learning model performance is a crucial aspect of your work. To do so effectively, you have a wide range of statistical metrics at your disposal, each with its own unique strengths and weaknesses. By developing a solid understanding of these metrics, you are not only better equipped to choose the best one for optimizing your model but also to explain your��
]]>Linear regression is one of the simplest machine learning models out there. It is often the starting point not only for learning about data science but also for building quick and simple minimum viable products (MVPs), which then serve as benchmarks for more complex algorithms. In general, linear regression fits a line (in two dimensions) or a hyperplane (in three and more dimensions) that��
]]>Inverse lithography technology (ILT) was first implemented and demonstrated in early 2003. It was created by Danping Peng, while he worked as an engineer at Luminescent Technologies Inc., a startup company founded by professors Stanley Osher and Eli Yabonovitch from UCLA and entrepreneurs Dan Abrams and Jack Herrik. At that time, ILT was a revolutionary solution that showed far superior��
]]>Data Scientists deal with algorithms daily. However, the data science discipline as a whole has developed into a role that does not involve implementation of sophisticated algorithms. Nonetheless, practitioners can still benefit from building an understanding and repertoire of algorithms. In this article, the sorting algorithm merge sort is introduced, explained, evaluated, and implemented.
]]>This article will discuss how to prepare text through vectorization, hashing, tokenization, and other techniques, to be compatible with machine learning (ML) and other numerical algorithms. I��ll explain and demonstrate the process. Natural language processing (NLP) applies machine learning (ML) and other techniques to language. However, machine learning and other techniques typically work on��
]]>Algorithms are commonplace in the world of data science and machine learning. Algorithms power social media applications, Google search results, banking systems and plenty more. Therefore, it��s paramount that Data Scientists and machine-learning practitioners have an intuition for analyzing, designing, and implementing algorithms. Efficient algorithms have saved companies millions of dollars��
]]>Facebook researchers developed a reinforcement learning model that can outmatch human competitors in heads-up, no-limit Texas hold��em, and turn endgame hold��em poker. At the heart of the model is how software-agents handle perfect-information games such as chess, versus imperfect-information games like poker. Instead of just deciding on its next move, a reinforcement learning software��
]]>Recently the Allen Institute for Artificial Intelligence announced a breakthrough for a BERT-based model, passing a 12th-grade science test. The GPU-accelerated system called Aristo can read, learn, and reason about science, in this case emulating the decision making of students. For this milestone, Aristo answered more than 90 percent of the questions on an eighth-grade science exam correctly��
]]>London based startup Fabula AI has developed a deep learning-based system that can help identify fake news across online platforms. ��Automatically detecting fake news poses challenges that defy existing approaches based on linguistic content analysis,�� the company stated in a blog post. ��News is often highly nuanced and their interpretation requires the knowledge of political or social context��
]]>In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups��
]]>There��s a new computational workhorse in town. For decades, general matrix-matrix multiply��known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries��has been a standard benchmark for computational performance. GEMM is possibly the most optimized and widely used routine in scientific computing. Expert implementations are available for every architecture and quickly achieve the peak��
]]>Google recently announced the release of version 1.0 of its TensorFlow deep learning framework at their inaugural TensorFlow Developer Summit. In just its first year, the popular framework has helped researchers make progress with everything from language translation to early detection of skin cancer and preventing blindness in diabetics. The first major version comes with some fantastic new��
]]>Russian scientists from Lomonosov Moscow State University used an ordinary GPU-accelerated desktop computer to solve complex quantum mechanics equations in just 15 minutes that would typically take two to three days on a large CPU-only supercomputer. Senior researchers Vladimir Pomerantcev and Olga Rubtsova and professor Vladimir Kukulin used a GeForce GTX 670 with CUDA and the PGI CUDA Fortran��
]]>Adam McLaughlin, PhD student at Georgia Tech shares how he is using NVIDIA Tesla GPUs for his research on Betweenness Centrality �C a graph analytics algorithm that tracks the most important vertices within a network. This can be applied to a broad range of applications, such as finding the head of a crime ring or determining the best location for a store within a city. Using a cluster of GPUs for��
]]>Daniel Ambrosi, Artist and Photographer, is using NVIDIA GPUs in the Amazon cloud and CUDA to create giant 2D-stitched HDR panoramas called ��Dreamscapes.�� Ambrosi applies a modified version of Google��s DeepDream neural net visualization code to his original panoramic landscape images to create truly one-of-a-kind pieces of art. For more information visit http://www.danielambrosi.com/
]]>Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an equation: direct and iterative. Direct methods are usually robust, but have additional computational complexity and memory capacity requirements. Unlike direct solvers, iterative solvers require minimal memory overhead and feature better��
]]>Columbia University researchers have created a robotic system that detects wrinkles and then irons the piece of cloth autonomously. Their paper highlights the ironing process is the final step needed in their ��pipeline�� of a robot picking up a wrinkled shirt, then laying it on the table and lastly, folding the shirt with robotic arms. A GeForce GTX 770 GPU was used for their ��wrinkle analysis��
]]>Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two ��Distinguished Papers�� of the 51 accepted at Euro-Par 2015. Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria��s Vienna University of Technology. Leyuan��s paper Fast Parallel Suffix Array on the GPU, co-authored by her��
]]>2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs��
]]>Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let��s focus on one of the most common use cases: an image histogram using 256 bins for each��
]]>Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort. In fact, the code generated by the compiler is actually faster than the manually-written warp aggregation code. This post is mainly intended for those who want to learn how��
]]>Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory��
]]>In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part��
]]>In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while uniform grid relies on too many simplifying assumptions that��
]]>This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using broad-phase collision detection as an example. The first part will give some background, discuss two commonly used approaches, and introduce the concept of divergence. The second part will switch gears to hierarchical tree traversal in order��
]]>Fresh from the NVIDIA Numeric Libraries Team, a white paper illustrating the use of the CUSPARSE and CUBLAS libraries to achieve a 2x speedup of incomplete-LU- and Cholesky-preconditioned iterative methods. The paper focuses on the Bi-Conjugate Gradient and stabilized Conjugate Gradient iterative methods that can be used to solve large sparse non-symmetric and symmetric positive definite linear��
]]>