In traditional clinical medical practice, treatment decisions are often based on general guidelines, past experiences, and trial-and-error approaches. Today, with access to electronic medical records (EMRs) and genomic data, a new era of precision medicine is emerging��one where treatments are tailored to individual patients with unprecedented accuracy. Precision medicine is an innovative approach��
]]>The world of big data analytics is constantly seeking ways to accelerate processing and reduce infrastructure costs. Apache Spark has become a leading platform for scale-out analytics, handling massive datasets for ETL, machine learning, and deep learning workloads. While traditionally CPU-based, the advent of GPU acceleration offers a compelling promise: significant speedups for data processing��
]]>Stacking generalization is a widely used technique among machine learning (ML) engineers, where multiple models are combined to boost overall predictive performance. On the other hand, hyperparameter optimization (HPO) involves systematically searching for the best set of hyperparameters to maximize the performance of a given ML algorithm. A common challenge when using both stacking and HPO��
]]>Feature engineering remains one of the most effective ways to improve model accuracy when working with tabular data. Unlike domains such as NLP and computer vision, where neural networks can extract rich patterns from raw inputs, the best-performing tabular models��particularly gradient-boosted decision trees��still gain a significant advantage from well-crafted features. However��
]]>As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage format designed for efficient data processing at scale. By organizing data by columns rather than rows, Parquet enables high-performance querying and analysis, as it can read only the necessary columns for a query instead of scanning entire��
]]>JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has been in existence since the early 2000s and came from the need for communication between web servers and browsers. The standard JSON format consists of key-value pairs that can include nested objects. JSON has grown in usage for storing web��
]]>Time series forecasting is a powerful data science technique used to predict future values based on data points from the past Open source Python libraries like skforecast make it easy to run time series forecasts on your data. They allow you to ��bring your own�� regressor that is compatible with the scikit-learn API, giving you the flexibility to work seamlessly with the model of your choice.
]]>In the rapidly evolving landscape of artificial intelligence, the quality of the data used for training models is paramount. High-quality data ensures that models are accurate, reliable, and capable of generalizing well across various applications. The recent NVIDIA webinar, Enhance Generative AI Model Accuracy with High-Quality Multimodal Data Processing, dove into the intricacies of data��
]]>Modern classification workflows often require classifying individual records and data points into multiple categories instead of just assigning a single label. Open-source Python libraries like scikit-learn make it easier to build models for these multi-label problems. Several models have built-in support for multi-label datasets, and a simple scikit-learn utility function enables using those��
]]>introduced in a previous post, is a GPU-accelerated library that accelerates pandas to deliver significant performance improvements��up to 50x faster��without requiring any changes to your existing code. As part of the NVIDIA RAPIDS ecosystem, acts as a proxy layer that executes operations on the GPU when possible, and falls back to the CPU (via pandas) when necessary.
]]>As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on how changes to individual components of their app impact key business metrics. Over the last decade, econometricians have developed a technique called double machine learning that brings the power of machine learning models to causal��
]]>Open-source datasets have significantly democratized access to high-quality data, lowering the barriers of entry for developers and researchers to train cutting-edge generative AI models. By providing free access to diverse, high-quality, and well-curated datasets, open-source datasets enable the open-source community to train models at or close to the frontier, facilitating the rapid advancement��
]]>NeMo Curator now supports images, enabling you to process data for training accurate generative AI models.
]]>Today, Polars released a new GPU engine powered by RAPIDS cuDF that accelerates Polars workflows up to 13x on NVIDIA GPUs, allowing data scientists to process hundreds of millions of rows of data in seconds on a single machine. Traditional data processing libraries like pandas are single-threaded and become impractical to use beyond a few million rows of data.
]]>The International Society of Automation (ISA) reports that 5% of plant production is lost annually due to downtime. Putting that into a different context, roughly $647B is surrendered on a global basis by manufacturers across all industry segments, the corresponding portion of nearly $13T in production. The challenge at hand is predicting the maintenance needs of these machines to minimize��
]]>RAPIDS 24.08 is now available with significant updates geared towards processing larger workloads and seamless CPU/GPU interoperability.
]]>NVIDIA has released RAPIDS cuDF unified memory and text data processing features that help data scientists continue to use pandas when working with larger and text-heavy datasets in demanding workloads. Data scientists can now accelerate these workloads by up to 30x. RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuDF is a Python GPU DataFrame library for��
]]>What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for: The benefits are? great, but training and deploying large models can be computationally expensive and resource-intensive. Computationally efficient, cost-effective, and energy-efficient systems, architected to deliver real-time��
]]>Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform, load (ETL) workloads in business intelligence, recommender systems, cybersecurity, geospatial, and other applications. List types can be used to easily attach multiple transactions to a user without creating a new lookup table��
]]>See how KDNuggets achieved 500x speedup using CuPy and NVIDIA CUDA on 3D arrays.
]]>Meta, NetworkX, Fast.ai, and other industry leaders share how to gain new insights from your data with emerging tools.
]]>Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings and performance gains. We demonstrated this in our previous post, GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks. In this post, we dive deeper to identify precisely which��
]]>Gathering business insights can be a pain, especially when you��re dealing with countless data points. It��s no secret that GPUs can be a time-saver for data scientists. Rather than wait for a single query to run, GPUs help speed up the process and get you the insights you need quickly. In this video, Allan Enemark, RAPIDS data visualization lead, uses a US Census dataset with over 300��
]]>We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer.
]]>Digital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression can reduce the file size to 1 or 2 GB per slide, but this volume of data is still challenging to move around, save, load, and view. To view a typical whole slide image at full resolution would require a monitor about the size of a tennis��
]]>Visualization brings data to life, unveiling hidden patterns and insights through accessible visuals, and empowering you and your organization to perceive the invisible, make informed decisions, and fully leverage your data. Especially when working with large datasets, interaction can be difficult as render and compute times become prohibitive. Switching to RAPIDS libraries, such as cuDF��
]]>If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver insights quickly with breakthrough performance. From faster computation to efficient model training, GPUs bring many benefits to everyday ML tasks. Update: The below blog describes how to use GPU-only RAPIDS cuDF��
]]>AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models, and more. AI machine learning (ML) models help automate many business processes, generate insights from data, and deliver new experiences. Python is one of the most popular languages used in AI/ML development. In this post��
]]>Apache Spark is an industry-leading platform for distributed extract, transform, and load (ETL) workloads on large-scale data. However, with the advent of deep learning (DL), many Spark practitioners have sought to add DL models to their data processing pipelines across a variety of use cases like sales predictions, content recommendations, sentiment analysis, and fraud detection. Yet��
]]>A retailer��s supply chain includes the sourcing of raw materials or finished goods from suppliers; storing them in warehouses or distribution centers; and transporting them to stores or customers; managing sales. They also collect, store, and analyze data to optimize supply chain performance. Retailers have teams responsible for managing each stage of the supply chain��
]]>This post is part of a series on accelerated data analytics. Digital advancements in climate modeling, healthcare, finance, and retail are generating unprecedented volumes and types of data. IDC says that by 2025, there will be 180 ZB of data compared to 64 ZB in 2020, scaling up the need for data analytics to turn all that data into insights. NVIDIA provides the RAPIDS suite of��
]]>This post is part of a series on accelerated data analytics. Update: The below blog describes how to use GPU-only RAPIDS cuDF, which requires code changes. RAPIDS cuDF now has a CPU/GPU interoperability (cudf.pandas) that speeds up pandas code by up to 150x with zero code changes. At GTC 2024, NVIDIA announced that the cudf.pandas library is now GA. At Google I/O��
]]>Data is one of the most valuable assets that a business can possess. It sits at the core of data science and data analysis: without data, they��re both obsolete. Businesses that actively collect data may have a competitive advantage over those that do not. With sufficient data, organizations can better determine the cause of problems and make informed decisions. There are scenarios where an��
]]>It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the data-to-AI pipeline? For example, it would be simpler if you did not have to switch out cluster configurations for each pipeline processing stage. You might still have some questions: At AT&T, these questions arose when our��
]]>In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization? Within the RAPIDS data science framework, ETL tools are designed to have a familiar look and feel to data scientists working in Python. Do you currently use Pandas, NumPy, Scikit-learn��
]]>Data is the lifeblood of modern enterprises, whether you��re a retailer, financial service company, or digital advertiser. Across industries, organizations are recognizing the importance of their data for business analytics, machine learning, and AI. Smart businesses are investing in new ways to extract value from their data: to better understand customer needs and behaviors��
]]>Data processing is increasingly making use of NVIDIA computing for massive parallelism. Advancements in accelerated compute mean that access to storage must also be quicker, whether in analytics, artificial intelligence (AI), or machine learning (ML) pipelines. The benefits from GPU acceleration are limited if data access dominates the execution time. GPU-based processing drives a��
]]>Recently, NVIDIA CEO Jensen Huang announced updates to the open beta of NVIDIA Merlin, an end-to-end framework that democratizes the development of large-scale deep learning recommenders. With NVIDIA Merlin, data scientists, machine learning engineers, and researchers can accelerate their entire workflow pipeline from ingesting and training to deploying GPU-accelerated recommenders (Figure 1).
]]>Recommender systems are ubiquitous in online platforms, helping users navigate through an exponentially growing number of goods and services. These models are key in driving user engagement. With the rapid growth in scale of industry datasets, deep learning (DL) recommender models have started to gain advantages over traditional methods by capitalizing on large amounts of training data.
]]>