Gregory Kimball – NVIDIA Technical Blog

Gregory Kimball – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-04-23T02:44:00Z http://www.open-lab.net/blog/feed/ Gregory Kimball <![CDATA[Efficient ETL with Polars and Apache Spark on NVIDIA Grace CPU]]> http://www.open-lab.net/blog/?p=96807 2025-04-23T00:33:58Z 2025-03-11T18:30:00Z

The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The...]]>

The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The benefits of NVIDIA Grace include high-performance Arm Neoverse V2 cores, fast NVIDIA-designed Scalable Coherency Fabric, and low-power high-bandwidth LPDDR5X memory. These features make the Grace CPU ideal for data processing with…

]]> Gregory Kimball <![CDATA[JSON Lines Reading with pandas 100x Faster Using NVIDIA cuDF]]> http://www.open-lab.net/blog/?p=95970 2025-04-23T02:44:00Z 2025-02-20T17:00:00Z

JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications and large language models...]]>

JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications and large language models (LLMs). While the JSON format is human-readable, it is complex to process with data science and data engineering tools. JSON data often takes the form of newline-delimited JSON Lines (also known as NDJSON) to represent multiple records…

]]> Gregory Kimball <![CDATA[Supercharging Deduplication in pandas Using RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=92703 2024-12-12T19:38:34Z 2024-11-28T14:00:00Z

A common operation in data analytics is to drop duplicate rows. Deduplication is critical in Extract, Transform, Load (ETL) workflows, where you might want to...]]>

]]> Gregory Kimball <![CDATA[Scaling Up to One Billion Rows of Data in pandas using RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=88761 2024-09-25T17:26:00Z 2024-09-11T16:54:53Z

The One Billion Row Challenge is a fun benchmark to showcase basic data processing operations. It was originally launched as a pure-Java competition, and has...]]>

The One Billion Row Challenge is a fun benchmark to showcase basic data processing operations. It was originally launched as a pure-Java competition, and has gathered a community of developers in other languages, including Python, Rust, Go, Swift, and more. The challenge has been useful for many software engineers with an interest in exploring the details of text file reading…

]]> Gregory Kimball <![CDATA[Encoding and Compression Guide for Parquet String Data Using RAPIDS]]> http://www.open-lab.net/blog/?p=85090 2024-08-08T18:48:49Z 2024-07-17T16:00:00Z

Parquet writers provide encoding and compression options that are turned off by default. Enabling these options may provide better lossless compression for your...]]>

Parquet writers provide encoding and compression options that are turned off by default. Enabling these options may provide better lossless compression for your data, but understanding which options to use for your specific use case is critical to making sure they perform as intended. In this post, we explore which encoding and compression options work best for your string data.

]]> 1 Gregory Kimball <![CDATA[Streamline ETL Workflows with Nested Data Types in RAPIDS libcudf]]> http://www.open-lab.net/blog/?p=75553 2024-01-22T21:35:40Z 2023-12-15T21:16:55Z

Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform,...]]>

Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform, load (ETL) workloads in business intelligence, recommender systems, cybersecurity, geospatial, and other applications. List types can be used to easily attach multiple transactions to a user without creating a new lookup table…

]]> 2 Gregory Kimball <![CDATA[GPU-Accelerated JSON Data Processing with RAPIDS]]> http://www.open-lab.net/blog/?p=60657 2023-11-20T23:12:50Z 2023-02-09T17:30:00Z

JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications. While the JSON format is...]]>

JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications. While the JSON format is human-readable, it is complex to process with data science and data engineering tools. To bridge that gap, RAPIDS cuDF provides a GPU-accelerated JSON reader (cudf.read_json) that is efficient and robust for many JSON data structures.

]]> 0 Gregory Kimball <![CDATA[Mastering String Transformations in RAPIDS libcudf]]> http://www.open-lab.net/blog/?p=56138 2023-06-12T08:45:27Z 2022-10-17T14:00:00Z

Efficient processing of string data is vital for many data science applications. To extract valuable information from string data, RAPIDS libcudf provides...]]>

Efficient processing of string data is vital for many data science applications. To extract valuable information from string data, RAPIDS libcudf provides powerful tools for accelerating string data transformations. libcudf is a C++ GPU DataFrame library used for loading, joining, aggregating, and filtering data. In data science, string data represents speech, text, genetic sequences…

]]> 5 Gregory Kimball <![CDATA[Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=47682 2023-06-12T20:36:40Z 2022-05-27T21:45:32Z

If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can...]]>

If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can often be one of the most time-consuming steps in the workflow because of the data volume and the complexity of commonly used formats. Optimizing data ingest can greatly reduce this bottleneck for data scientists working on large data sets.

]]> 5 ��˳��97caoporen��