Data Analytics / Processing – NVIDIA Technical Blog

Data Analytics / Processing – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-07-03T22:20:47Z http://www.open-lab.net/blog/feed/ Yu-Ting Lin <![CDATA[Spotlight: Atgenomix SeqsLab Scales Health Omics Analysis for Precision Medicine]]> http://www.open-lab.net/blog/?p=100183 2025-05-29T17:30:54Z 2025-05-19T20:00:00Z

In traditional clinical medical practice, treatment decisions are often based on general guidelines, past experiences, and trial-and-error approaches. Today,...]]>

In traditional clinical medical practice, treatment decisions are often based on general guidelines, past experiences, and trial-and-error approaches. Today,...

In traditional clinical medical practice, treatment decisions are often based on general guidelines, past experiences, and trial-and-error approaches. Today, with access to electronic medical records (EMRs) and genomic data, a new era of precision medicine is emerging��one where treatments are tailored to individual patients with unprecedented accuracy. Precision medicine is an innovative approach��

]]> 0 Matt Ahrens <![CDATA[Predicting Performance on Apache Spark with GPUs]]> http://www.open-lab.net/blog/?p=100118 2025-05-29T19:04:59Z 2025-05-15T17:00:00Z

The world of big data analytics is constantly seeking ways to accelerate processing and reduce infrastructure costs. Apache Spark has become a leading platform...]]>

The world of big data analytics is constantly seeking ways to accelerate processing and reduce infrastructure costs. Apache Spark has become a leading platform...

Predicting Performance on Apache Spark with GPUs

The world of big data analytics is constantly seeking ways to accelerate processing and reduce infrastructure costs. Apache Spark has become a leading platform for scale-out analytics, handling massive datasets for ETL, machine learning, and deep learning workloads. While traditionally CPU-based, the advent of GPU acceleration offers a compelling promise: significant speedups for data processing��

]]> 0 Allison Ding <![CDATA[Stacking Generalization with HPO: Maximize Accuracy in 15 Minutes with NVIDIA cuML]]> http://www.open-lab.net/blog/?p=99417 2025-05-15T19:08:30Z 2025-05-01T18:35:18Z

Stacking generalization is a widely used technique among machine learning (ML) engineers, where multiple models are combined to boost overall predictive...]]>

Stacking generalization is a widely used technique among machine learning (ML) engineers, where multiple models are combined to boost overall predictive...

accelerated-data-science-scikit-learn-techblog-1920x1080

Stacking generalization is a widely used technique among machine learning (ML) engineers, where multiple models are combined to boost overall predictive performance. On the other hand, hyperparameter optimization (HPO) involves systematically searching for the best set of hyperparameters to maximize the performance of a given ML algorithm. A common challenge when using both stacking and HPO��

]]> 0 Chris Deotte https://www.kaggle.com/cdeotte <![CDATA[Grandmaster Pro Tip: Winning First Place in Kaggle Competition with Feature Engineering Using cuDF pandas]]> http://www.open-lab.net/blog/?p=98938 2025-05-19T22:17:46Z 2025-04-17T23:03:20Z

Feature engineering remains one of the most effective ways to improve model accuracy when working with tabular data. Unlike domains such as NLP and computer...]]>

Feature engineering remains one of the most effective ways to improve model accuracy when working with tabular data. Unlike domains such as NLP and computer...

NVIDIA feature engineering kaggle blog

Feature engineering remains one of the most effective ways to improve model accuracy when working with tabular data. Unlike domains such as NLP and computer vision, where neural networks can extract rich patterns from raw inputs, the best-performing tabular models��particularly gradient-boosted decision trees��still gain a significant advantage from well-crafted features. However��

]]> 0 Matt Ahrens <![CDATA[Accelerating Apache Parquet Scans on Apache Spark with GPUs]]> http://www.open-lab.net/blog/?p=98350 2025-06-12T15:22:49Z 2025-04-03T16:18:03Z

As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage...]]>

As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage... Decorative image.

Decorative image.

As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage format designed for efficient data processing at scale. By organizing data by columns rather than rows, Parquet enables high-performance querying and analysis, as it can read only the necessary columns for a query instead of scanning entire��

]]> 3 Matt Ahrens <![CDATA[Accelerating JSON Processing on Apache Spark with GPUs]]> http://www.open-lab.net/blog/?p=95298 2025-04-23T15:01:08Z 2025-01-29T22:10:22Z

JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has...]]>

JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has... A diagram of how JSON data is processed.

A diagram of how JSON data is processed.

JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has been in existence since the early 2000s and came from the need for communication between web servers and browsers. The standard JSON format consists of key-value pairs that can include nested objects. JSON has grown in usage for storing web��

]]> 0 Brian Tepera <![CDATA[Accelerating Time Series Forecasting with RAPIDS cuML]]> http://www.open-lab.net/blog/?p=95127 2025-01-23T19:54:21Z 2025-01-16T17:20:10Z

Time series forecasting is a powerful data science technique used to predict future values based on data points from the past Open source Python libraries like...]]>

Time series forecasting is a powerful data science technique used to predict future values based on data points from the past Open source Python libraries like...

Accelerating Time Series Forecasting with RAPIDS cuML

Time series forecasting is a powerful data science technique used to predict future values based on data points from the past Open source Python libraries like skforecast make it easy to run time series forecasts on your data. They allow you to ��bring your own�� regressor that is compatible with the scikit-learn API, giving you the flexibility to work seamlessly with the model of your choice.

]]> 0 Nirmal Kumar Juluru <![CDATA[Enhancing Generative AI Model Accuracy with NVIDIA NeMo Curator]]> http://www.open-lab.net/blog/?p=94263 2025-01-23T19:54:27Z 2025-01-13T17:00:00Z

In the rapidly evolving landscape of artificial intelligence, the quality of the data used for training models is paramount. High-quality data ensures that...]]>

In the rapidly evolving landscape of artificial intelligence, the quality of the data used for training models is paramount. High-quality data ensures that... NVIDIA NeMo Curator icon on a purple background.

NVIDIA NeMo Curator icon on a purple background.

In the rapidly evolving landscape of artificial intelligence, the quality of the data used for training models is paramount. High-quality data ensures that models are accurate, reliable, and capable of generalizing well across various applications. The recent NVIDIA webinar, Enhance Generative AI Model Accuracy with High-Quality Multimodal Data Processing, dove into the intricacies of data��

]]> 0 Nick Becker <![CDATA[Harnessing GPU Acceleration for Multi-Label Classification with RAPIDS cuML]]> http://www.open-lab.net/blog/?p=93575 2024-12-12T19:17:22Z 2024-12-12T16:55:40Z

Modern classification workflows often require classifying individual records and data points into multiple categories instead of just assigning a single label....]]>

Modern classification workflows often require classifying individual records and data points into multiple categories instead of just assigning a single label....

RAPIDS-cuML-Multi-label-Classification-1480x830

Modern classification workflows often require classifying individual records and data points into multiple categories instead of just assigning a single label. Open-source Python libraries like scikit-learn make it easier to build models for these multi-label problems. Several models have built-in support for multi-label datasets, and a simple scikit-learn utility function enables using those��

]]> 0 Prem Sagar Gali <![CDATA[Unified Virtual Memory Supercharges pandas with RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=93438 2024-12-12T19:35:20Z 2024-12-05T19:07:07Z

cuDF-pandas, introduced in a previous post, is a GPU-accelerated library that accelerates pandas to deliver significant performance improvements��up to 50x...]]>

cuDF-pandas, introduced in a previous post, is a GPU-accelerated library that accelerates pandas to deliver significant performance improvements��up to 50x...

rapids-image-colab-pandas-50x-3333627-1480x830-r1

introduced in a previous post, is a GPU-accelerated library that accelerates pandas to deliver significant performance improvements��up to 50x faster��without requiring any changes to your existing code. As part of the NVIDIA RAPIDS ecosystem, acts as a proxy layer that executes operations on the GPU when possible, and falls back to the CPU (via pandas) when necessary.

]]> 0 Nick Becker <![CDATA[Faster Causal Inference on Large Datasets with NVIDIA RAPIDS]]> http://www.open-lab.net/blog/?p=91854 2024-11-18T20:15:01Z 2024-11-14T16:00:00Z

As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on...]]>

As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on...

Stock market chart. Big Data. Business Graph. Investment graph. Abstract financial chart. 3D rendering.

As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on how changes to individual components of their app impact key business metrics. Over the last decade, econometricians have developed a technique called double machine learning that brings the power of machine learning models to causal��

]]> 0 Nirmal Kumar Juluru <![CDATA[Train Highly Accurate LLMs with the Zyda-2 Open 5T-Token Dataset Processed with NVIDIA NeMo Curator]]> http://www.open-lab.net/blog/?p=89677 2024-10-18T20:10:29Z 2024-10-15T18:00:00Z

Open-source datasets have significantly democratized access to high-quality data, lowering the barriers of entry for developers and researchers to train...]]>

Open-source datasets have significantly democratized access to high-quality data, lowering the barriers of entry for developers and researchers to train... Decorative image.

Decorative image.

Open-source datasets have significantly democratized access to high-quality data, lowering the barriers of entry for developers and researchers to train cutting-edge generative AI models. By providing free access to diverse, high-quality, and well-curated datasets, open-source datasets enable the open-source community to train models at or close to the frontier, facilitating the rapid advancement��

]]> 0 Nirmal Kumar Juluru <![CDATA[Just Released: NVIDIA NeMo Curator Improvements for Accelerating Data Curation]]> http://www.open-lab.net/blog/?p=89756 2024-10-18T20:10:53Z 2024-10-04T16:00:00Z

NeMo Curator now supports images, enabling you to process data for training accurate generative AI models.]]>

NeMo Curator now supports images, enabling you to process data for training accurate generative AI models. NVIDIA NeMo Curator icon on a purple background.

NVIDIA NeMo Curator icon on a purple background.

NeMo Curator now supports images, enabling you to process data for training accurate generative AI models.

]]> 0 Jamil Semaan <![CDATA[Polars GPU Engine Powered by RAPIDS cuDF Now Available in Open Beta]]> http://www.open-lab.net/blog/?p=89052 2024-12-12T22:32:12Z 2024-09-17T14:00:00Z

Today, Polars released a new GPU engine powered by RAPIDS cuDF that accelerates Polars workflows up to 13x on NVIDIA GPUs, allowing data scientists to process...]]>

Today, Polars released a new GPU engine powered by RAPIDS cuDF that accelerates Polars workflows up to 13x on NVIDIA GPUs, allowing data scientists to process...

RAPIDS Polars TechBlog image

Today, Polars released a new GPU engine powered by RAPIDS cuDF that accelerates Polars workflows up to 13x on NVIDIA GPUs, allowing data scientists to process hundreds of millions of rows of data in seconds on a single machine. Traditional data processing libraries like pandas are single-threaded and become impractical to use beyond a few million rows of data.

]]> 1 Amarnath Mohan <![CDATA[Accelerating Predictive Maintenance in Manufacturing with RAPIDS AI]]> http://www.open-lab.net/blog/?p=87334 2024-09-05T17:57:10Z 2024-08-30T15:58:23Z

The International Society of Automation (ISA) reports that 5% of plant production is lost annually due to downtime. Putting that into a different context,...]]>

The International Society of Automation (ISA) reports that 5% of plant production is lost annually due to downtime. Putting that into a different context,...

The International Society of Automation (ISA) reports that 5% of plant production is lost annually due to downtime. Putting that into a different context, roughly $647B is surrendered on a global basis by manufacturers across all industry segments, the corresponding portion of nearly $13T in production. The challenge at hand is predicting the maintenance needs of these machines to minimize��

]]> 0 Prachi Goel <![CDATA[Just Released: RAPIDS 24.08]]> http://www.open-lab.net/blog/?p=88370 2024-09-05T17:57:13Z 2024-08-29T16:00:58Z

RAPIDS 24.08 is now available with significant updates geared towards processing larger workloads and seamless CPU/GPU interoperability.]]>

RAPIDS 24.08 is now available with significant updates geared towards processing larger workloads and seamless CPU/GPU interoperability.

RAPIDS 24.08 is now available with significant updates geared towards processing larger workloads and seamless CPU/GPU interoperability.

]]> 0 Prachi Goel <![CDATA[RAPIDS cuDF Unified Memory Accelerates pandas up to 30x on Large Datasets]]> http://www.open-lab.net/blog/?p=87019 2024-08-22T18:25:33Z 2024-08-09T16:00:00Z

NVIDIA has released RAPIDS cuDF unified memory and text data processing features that help data scientists continue to use pandas when working with larger and...]]>

NVIDIA has released RAPIDS cuDF unified memory and text data processing features that help data scientists continue to use pandas when working with larger and...

laptop-data-science

NVIDIA has released RAPIDS cuDF unified memory and text data processing features that help data scientists continue to use pandas when working with larger and text-heavy datasets in demanding workloads. Data scientists can now accelerate these workloads by up to 30x. RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuDF is a Python GPU DataFrame library for��

]]> 0 Ivan Goldwasser <![CDATA[NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference]]> http://www.open-lab.net/blog/?p=79550 2024-07-12T14:47:47Z 2024-03-18T23:00:00Z

What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for:...]]>

What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for:... An image of the GB200 NVL72 and NVLink spine.

An image of the GB200 NVL72 and NVLink spine.

What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for: The benefits are? great, but training and deploying large models can be computationally expensive and resource-intensive. Computationally efficient, cost-effective, and energy-efficient systems, architected to deliver real-time��

]]> 13 Gregory Kimball <![CDATA[Streamline ETL Workflows with Nested Data Types in RAPIDS libcudf]]> http://www.open-lab.net/blog/?p=75553 2024-01-22T21:35:40Z 2023-12-15T21:16:55Z

Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform,...]]>

Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform,...

data-montage-watercolor

Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform, load (ETL) workloads in business intelligence, recommender systems, cybersecurity, geospatial, and other applications. List types can be used to easily attach multiple transactions to a user without creating a new lookup table��

]]> 2 Joseph Chandler <![CDATA[ICYMI: Leveraging the Power of GPUs with CuPy in Python]]> http://www.open-lab.net/blog/?p=72637 2023-11-16T19:16:47Z 2023-11-06T19:17:06Z

See how KDNuggets achieved 500x speedup using CuPy and NVIDIA CUDA on 3D arrays.]]>

See how KDNuggets achieved 500x speedup using CuPy and NVIDIA CUDA on 3D arrays. Written text:

Written text:

See how KDNuggets achieved 500x speedup using CuPy and NVIDIA CUDA on 3D arrays.

]]> 0 Tanya Lenz <![CDATA[Event: AI and Data Science Virtual Summit]]> http://www.open-lab.net/blog/?p=71481 2023-11-02T18:14:44Z 2023-10-10T19:30:00Z

Meta, NetworkX, Fast.ai, and other industry leaders share how to gain new insights from your data with emerging tools.]]>

Meta, NetworkX, Fast.ai, and other industry leaders share how to gain new insights from your data with emerging tools.

data-science-summit-graphic

Meta, NetworkX, Fast.ai, and other industry leaders share how to gain new insights from your data with emerging tools.

]]> 0 Joel Lashmore <![CDATA[GPUs for ETL? Optimizing ETL Architecture for Apache Spark SQL Operations]]> http://www.open-lab.net/blog/?p=70034 2023-11-10T01:26:53Z 2023-09-06T16:53:28Z

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings...]]>

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings...

GPUs for ETL Optimizing ETL Architecture for Apache Spark SQL Operations

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings and performance gains. We demonstrated this in our previous post, GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks. In this post, we dive deeper to identify precisely which��

]]> 0 Jess Nguyen <![CDATA[New Video: Visualizing Census Data with RAPIDS cuDF and Plotly Dash]]> http://www.open-lab.net/blog/?p=68219 2023-12-12T23:51:32Z 2023-07-17T21:00:00Z

Gathering business insights can be a pain, especially when you're dealing with countless data points. It��s no secret that GPUs can be a time-saver for...]]>

Gathering business insights can be a pain, especially when you're dealing with countless data points. It��s no secret that GPUs can be a time-saver for... A US map showing different colors representing data visualization.

A US map showing different colors representing data visualization.

Gathering business insights can be a pain, especially when you��re dealing with countless data points. It��s no secret that GPUs can be a time-saver for data scientists. Rather than wait for a single query to run, GPUs help speed up the process and get you the insights you need quickly. In this video, Allan Enemark, RAPIDS data visualization lead, uses a US Census dataset with over 300��

]]> 1 Joel Lashmore <![CDATA[GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks]]> http://www.open-lab.net/blog/?p=67503 2023-11-10T01:27:07Z 2023-07-17T18:08:30Z

We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on...]]>

We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on... Stylized image of a computer chip.

Stylized image of a computer chip.

We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer.

]]> 0 Jonny Hancox <![CDATA[Whole Slide Image Analysis in Real Time with MONAI and RAPIDS]]> http://www.open-lab.net/blog/?p=67984 2023-11-20T23:08:09Z 2023-07-13T17:00:00Z

Digital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression...]]>

Digital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression... Biospecimen

Biospecimen

Digital pathology slide scanners generate massive images. Glass slides are routinely scanned at 40x magnification, resulting in gigapixel images. Compression can reduce the file size to 1 or 2 GB per slide, but this volume of data is still challenging to move around, save, load, and view. To view a typical whole slide image at full resolution would require a monitor about the size of a tennis��

]]> 0 Allan Enemark <![CDATA[Accelerated Data Analytics: A Guide to Data Visualization with RAPIDS]]> http://www.open-lab.net/blog/?p=67804 2023-12-12T23:47:02Z 2023-07-11T20:00:00Z

Visualization brings data to life, unveiling hidden patterns and insights through accessible visuals, and empowering you and your organization to perceive the...]]>

Visualization brings data to life, unveiling hidden patterns and insights through accessible visuals, and empowering you and your organization to perceive the...

cuxfilter-datashader-divvy

Visualization brings data to life, unveiling hidden patterns and insights through accessible visuals, and empowering you and your organization to perceive the invisible, make informed decisions, and fully leverage your data. Especially when working with large datasets, interaction can be difficult as render and compute times become prohibitive. Switching to RAPIDS libraries, such as cuDF��

]]> 0 Jay Rodge <![CDATA[Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn]]> http://www.open-lab.net/blog/?p=67937 2024-05-15T16:11:39Z 2023-07-11T20:00:00Z

If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver...]]>

If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver... Decorative image.

Decorative image.

If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver insights quickly with breakthrough performance. From faster computation to efficient model training, GPUs bring many benefits to everyday ML tasks. Update: The below blog describes how to use GPU-only RAPIDS cuDF��

]]> 0 Shankar Chandrasekaran <![CDATA[How to Deploy an AI Model in Python with PyTriton]]> http://www.open-lab.net/blog/?p=67363 2023-11-10T01:28:34Z 2023-06-28T22:56:09Z

AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models,...]]>

AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models,...

pytriton-graphic2

AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models, and more. AI machine learning (ML) models help automate many business processes, generate insights from data, and deliver new experiences. Python is one of the most popular languages used in AI/ML development. In this post��

]]> 1 Lee Yang <![CDATA[Distributed Deep Learning Made Easy with Spark 3.4]]> http://www.open-lab.net/blog/?p=66415 2024-06-06T16:23:05Z 2023-06-12T20:30:00Z

Apache Spark is an industry-leading platform for distributed extract, transform, and load (ETL) workloads on large-scale data. However, with the advent of deep...]]>

Apache Spark is an industry-leading platform for distributed extract, transform, and load (ETL) workloads on large-scale data. However, with the advent of deep... Deep learning abstract.

Deep learning abstract.

Apache Spark is an industry-leading platform for distributed extract, transform, and load (ETL) workloads on large-scale data. However, with the advent of deep learning (DL), many Spark practitioners have sought to add DL models to their data processing pipelines across a variety of use cases like sales predictions, content recommendations, sentiment analysis, and fraud detection. Yet��

]]> 0 Saurav Agarwal <![CDATA[Smarter Retail Data Analytics with GPU Accelerated Apache Spark Workloads on Google Cloud Dataproc]]> http://www.open-lab.net/blog/?p=61822 2023-11-10T01:30:21Z 2023-03-15T16:00:00Z

A retailer's supply chain includes the sourcing of raw materials or finished goods from suppliers; storing them in warehouses or distribution centers; and...]]>

A retailer's supply chain includes the sourcing of raw materials or finished goods from suppliers; storing them in warehouses or distribution centers; and...

rapids-spark-gcp-featured

A retailer��s supply chain includes the sourcing of raw materials or finished goods from suppliers; storing them in warehouses or distribution centers; and transporting them to stores or customers; managing sales. They also collect, store, and analyze data to optimize supply chain performance. Retailers have teams responsible for managing each stage of the supply chain��

]]> 0 Prachi Goel <![CDATA[Accelerated Data Analytics: Speed Up Data Exploration with RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=61837 2023-12-12T23:48:52Z 2023-03-14T14:01:00Z

This post is part of a series on accelerated data analytics. Digital advancements in climate modeling, healthcare, finance, and retail are generating...]]>

This post is part of a series on accelerated data analytics. Digital advancements in climate modeling, healthcare, finance, and retail are generating...

data-analysis-accelerated-featured

This post is part of a series on accelerated data analytics. Digital advancements in climate modeling, healthcare, finance, and retail are generating unprecedented volumes and types of data. IDC says that by 2025, there will be 180 ZB of data compared to 64 ZB in 2020, scaling up the need for data analytics to turn all that data into insights. NVIDIA provides the RAPIDS suite of��

]]> 0 Prachi Goel <![CDATA[Accelerated Data Analytics: Faster Time Series Analysis with RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=61790 2025-05-07T22:43:39Z 2023-03-14T14:00:00Z

This post is part of a series on accelerated data analytics. [stextbox id="info"]Update: The below blog describes how to use GPU-only RAPIDS cuDF, which...]]>

This post is part of a series on accelerated data analytics. [stextbox id="info"]Update: The below blog describes how to use GPU-only RAPIDS cuDF, which... Abstract bar graph

Abstract bar graph

This post is part of a series on accelerated data analytics. Update: The below blog describes how to use GPU-only RAPIDS cuDF, which requires code changes. RAPIDS cuDF now has a CPU/GPU interoperability (cudf.pandas) that speeds up pandas code by up to 150x with zero code changes. At GTC 2024, NVIDIA announced that the cudf.pandas library is now GA. At Google I/O��

]]> 0 Kurtis Pykes <![CDATA[Scraping Real-Estate Sites for Data Acquisition with Scrapy]]> http://www.open-lab.net/blog/?p=57972 2023-11-10T01:31:49Z 2022-12-05T17:00:00Z

Data is one of the most valuable assets that a business can possess. It sits at the core of data science and data analysis: without data, they��re both...]]>

Data is one of the most valuable assets that a business can possess. It sits at the core of data science and data analysis: without data, they��re both... Web Scraping and How To Use It

Web Scraping and How To Use It

Data is one of the most valuable assets that a business can possess. It sits at the core of data science and data analysis: without data, they��re both obsolete. Businesses that actively collect data may have a competitive advantage over those that do not. With sufficient data, organizations can better determine the cause of problems and make informed decisions. There are scenarios where an��

]]> 0 Mark Austin <![CDATA[Scaling Data Pipelines: AT&T Optimizes Speed, Cost, and Efficiency with GPUs]]> http://www.open-lab.net/blog/?p=54713 2023-11-10T01:32:28Z 2022-09-12T16:00:00Z

It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the...]]>

It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the...

att-spark-featured

It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the data-to-AI pipeline? For example, it would be simpler if you did not have to switch out cluster configurations for each pipeline processing stage. You might still have some questions: At AT&T, these questions arose when our��

]]> 0 Jacob Tomlinson <![CDATA[Accelerating ETL on KubeFlow with RAPIDS]]> http://www.open-lab.net/blog/?p=54194 2023-11-10T01:32:59Z 2022-08-30T20:58:47Z

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL...]]>

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL...

accelerating-etl-featured

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization? Within the RAPIDS data science framework, ETL tools are designed to have a familiar look and feel to data scientists working in Python. Do you currently use Pandas, NumPy, Scikit-learn��

]]> 1 Judy McConnell <![CDATA[Evaluating Data Lakes and Data Warehouses as Machine Learning Data Repositories]]> http://www.open-lab.net/blog/?p=51051 2023-11-10T01:33:57Z 2022-07-29T15:00:00Z

Data is the lifeblood of modern enterprises, whether you��re a retailer, financial service company, or digital advertiser. Across industries, organizations are...]]>

Data is the lifeblood of modern enterprises, whether you��re a retailer, financial service company, or digital advertiser. Across industries, organizations are...

data-center

Data is the lifeblood of modern enterprises, whether you��re a retailer, financial service company, or digital advertiser. Across industries, organizations are recognizing the importance of their data for business analytics, machine learning, and AI. Smart businesses are investing in new ways to extract value from their data: to better understand customer needs and behaviors��

]]> 0 Dong Meng <![CDATA[Accelerating Analytics and AI with Alluxio and NVIDIA GPUs]]> http://www.open-lab.net/blog/?p=25111 2023-11-10T01:35:03Z 2021-03-23T12:00:00Z

Data processing is increasingly making use of NVIDIA computing for massive parallelism. Advancements in accelerated compute mean that access to storage must...]]>

Data processing is increasingly making use of NVIDIA computing for massive parallelism. Advancements in accelerated compute mean that access to storage must...

Acelerating Analytics and AI

Data processing is increasingly making use of NVIDIA computing for massive parallelism. Advancements in accelerated compute mean that access to storage must also be quicker, whether in analytics, artificial intelligence (AI), or machine learning (ML) pipelines. The benefits from GPU acceleration are limited if data access dominates the execution time. GPU-based processing drives a��

]]> 0 Vinh Nguyen <![CDATA[Announcing the NVIDIA NVTabular Open Beta with Multi-GPU Support and New Data Loaders]]> http://www.open-lab.net/blog/?p=21200 2024-10-28T18:24:20Z 2020-10-05T13:00:00Z

Recently, NVIDIA CEO Jensen Huang announced updates to the open beta of NVIDIA Merlin, an end-to-end framework that democratizes the development of large-scale...]]>

Recently, NVIDIA CEO Jensen Huang announced updates to the open beta of NVIDIA Merlin, an end-to-end framework that democratizes the development of large-scale...

merlin etl feature image_recommender-systems-dev-news-merlin-stack-2048x1024

Recently, NVIDIA CEO Jensen Huang announced updates to the open beta of NVIDIA Merlin, an end-to-end framework that democratizes the development of large-scale deep learning recommenders. With NVIDIA Merlin, data scientists, machine learning engineers, and researchers can accelerate their entire workflow pipeline from ingesting and training to deploying GPU-accelerated recommenders (Figure 1).

]]> 0 Vinh Nguyen <![CDATA[Accelerating ETL for Recommender Systems on NVIDIA GPUs with NVTabular]]> http://www.open-lab.net/blog/?p=18907 2024-10-28T18:16:58Z 2020-07-16T01:48:04Z

Recommender systems are ubiquitous in online platforms, helping users navigate through an exponentially growing number of goods and services. These models are...]]>

Recommender systems are ubiquitous in online platforms, helping users navigate through an exponentially growing number of goods and services. These models are...

nvtabular-positioning

Recommender systems are ubiquitous in online platforms, helping users navigate through an exponentially growing number of goods and services. These models are key in driving user engagement. With the rapid growth in scale of industry datasets, deep learning (DL) recommender models have started to gain advantages over traditional methods by capitalizing on large amounts of training data.

]]> 0 ��˳��97caoporen��