The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The benefits of NVIDIA Grace include high-performance Arm Neoverse V2 cores, fast NVIDIA-designed Scalable Coherency Fabric, and low-power high-bandwidth LPDDR5X memory. These features make the Grace CPU ideal for data processing with…
]]>JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications and large language models (LLMs). While the JSON format is human-readable, it is complex to process with data science and data engineering tools. JSON data often takes the form of newline-delimited JSON Lines (also known as NDJSON) to represent multiple records…
]]>The One Billion Row Challenge is a fun benchmark to showcase basic data processing operations. It was originally launched as a pure-Java competition, and has gathered a community of developers in other languages, including Python, Rust, Go, Swift, and more. The challenge has been useful for many software engineers with an interest in exploring the details of text file reading…
]]>Parquet writers provide encoding and compression options that are turned off by default. Enabling these options may provide better lossless compression for your data, but understanding which options to use for your specific use case is critical to making sure they perform as intended. In this post, we explore which encoding and compression options work best for your string data.
]]>Nested data types are a convenient way to represent hierarchical relationships within columnar data. They are frequently used as part of extract, transform, load (ETL) workloads in business intelligence, recommender systems, cybersecurity, geospatial, and other applications. List types can be used to easily attach multiple transactions to a user without creating a new lookup table…
]]>JSON is a widely adopted format for text-based information working interoperably between systems, most commonly in web applications. While the JSON format is human-readable, it is complex to process with data science and data engineering tools. To bridge that gap, RAPIDS cuDF provides a GPU-accelerated JSON reader (cudf.read_json) that is efficient and robust for many JSON data structures.
]]>Efficient processing of string data is vital for many data science applications. To extract valuable information from string data, RAPIDS libcudf provides powerful tools for accelerating string data transformations. libcudf is a C++ GPU DataFrame library used for loading, joining, aggregating, and filtering data. In data science, string data represents speech, text, genetic sequences…
]]>If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can often be one of the most time-consuming steps in the workflow because of the data volume and the complexity of commonly used formats. Optimizing data ingest can greatly reduce this bottleneck for data scientists working on large data sets.
]]>