In traditional clinical medical practice, treatment decisions are often based on general guidelines, past experiences, and trial-and-error approaches. Today, with access to electronic medical records (EMRs) and genomic data, a new era of precision medicine is emerging��one where treatments are tailored to individual patients with unprecedented accuracy. Precision medicine is an innovative approach��
]]>The world of big data analytics is constantly seeking ways to accelerate processing and reduce infrastructure costs. Apache Spark has become a leading platform for scale-out analytics, handling massive datasets for ETL, machine learning, and deep learning workloads. While traditionally CPU-based, the advent of GPU acceleration offers a compelling promise: significant speedups for data processing��
]]>Apache Spark is an industry-leading platform for big data processing and analytics. With the increasing prevalence of unstructured data��documents, emails, multimedia content��deep learning (DL) and large language models (LLMs) have become core components of the modern data analytics pipeline. These models enable a variety of downstream tasks, such as image captioning, semantic tagging��
]]>As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage format designed for efficient data processing at scale. By organizing data by columns rather than rows, Parquet enables high-performance querying and analysis, as it can read only the necessary columns for a query instead of scanning entire��
]]>The NVIDIA Grace CPU Superchip delivers outstanding performance and best-in-class energy efficiency for CPU workloads in the data center and in the cloud. The benefits of NVIDIA Grace include high-performance Arm Neoverse V2 cores, fast NVIDIA-designed Scalable Coherency Fabric, and low-power high-bandwidth LPDDR5X memory. These features make the Grace CPU ideal for data processing with��
]]>The NVIDIA RAPIDS Accelerator for Apache Spark software plug-in pioneered a zero code change user experience (UX) for GPU-accelerated data processing. It accelerates existing Apache Spark SQL and DataFrame-based applications on NVIDIA GPUs by over 9x without requiring a change to your queries or source code. This led to the new Spark RAPIDS ML Python library, which can speed up��
]]>JSON is a popular format for text-based data that allows for interoperability between systems in web applications as well as data management. The format has been in existence since the early 2000s and came from the need for communication between web servers and browsers. The standard JSON format consists of key-value pairs that can include nested objects. JSON has grown in usage for storing web��
]]>With the rapid growth of generative AI, CIOs and IT leaders are looking for ways to reclaim data center resources to accommodate new AI use cases that promise greater return on investment without impacting current operations. This is leading IT decision makers to reassess past infrastructure decisions and explore strategies to consolidate traditional workloads into fewer��
]]>With AI introducing an unprecedented pace of technological innovation, staying ahead means keeping your skills up to date. The NVIDIA Developer Program gives you the tools, training, and resources you need to succeed with the latest advancements across industries. We��re excited to announce the following five new technical courses from NVIDIA. Join the Developer Program now to get hands-on��
]]>As the scale of available data continues to grow, so does the need for scalable and intelligent data processing systems to swiftly harness useful knowledge. Especially in high-stakes domains such as life sciences and finance, alongside scalability, transparency of data-driven processes becomes paramount to ensure the utmost trustworthiness. Started by scientists coming from the Knowledge��
]]>Spark RAPIDS ML is an open-source Python package enabling NVIDIA GPU acceleration of PySpark MLlib. It offers PySpark MLlib DataFrame API compatibility and speedups when training with the supported algorithms. See New GPU Library Lowers Compute Costs for Apache Spark ML for more details. PySpark MLlib DataFrame API compatibility means easier incorporation into existing PySpark ML applications��
]]>Dive into the RAPIDS Accelerator for Apache Spark toolset, including the workload qualification tool for estimating speedup on GPU and the profiling tool for tuning jobs.
]]>Streamline and accelerate deployment by integrating ETL and ML training into a single Apache Spark script on Amazon EMR.
]]>Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings and performance gains. We demonstrated this in our previous post, GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks. In this post, we dive deeper to identify precisely which��
]]>We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer.
]]>Apache Spark is an industry-leading platform for distributed extract, transform, and load (ETL) workloads on large-scale data. However, with the advent of deep learning (DL), many Spark practitioners have sought to add DL models to their data processing pipelines across a variety of use cases like sales predictions, content recommendations, sentiment analysis, and fraud detection. Yet��
]]>When you see a context-relevant advertisement on a web page, it��s most likely content served by a Taboola data pipeline. As the leading content recommendation company in the world, a big challenge for Taboola was the frequent need to scale Apache Spark CPU cluster capacity to address the constantly growing compute and storage requirements. Data center capacity and hardware costs are always��
]]>Spark MLlib is a key component of Apache Spark for large-scale machine learning and provides built-in implementations of many popular machine learning algorithms. These implementations were created a decade ago, but do not leverage modern computing accelerators, such as NVIDIA GPUs. To address this gap, we have recently open-sourced Spark RAPIDS ML (NVIDIA/spark-rapids-ml)��
]]>The Dataiku platform for everyday AI simplifies deep learning. Use cases are far-reaching, from image classification to object detection and natural language processing (NLP). Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments. This post dives into high-level Dataiku and NVIDIA integrations for image��
]]>Generative AI has marked an important milestone in the AI revolution journey. We are at a fundamental breaking point where enterprises are not only getting their feet wet but jumping into the deep end. With over 50 frameworks, pretrained models, and development tools, NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform, is designed to accelerate enterprises to the leading edge��
]]>A retailer��s supply chain includes the sourcing of raw materials or finished goods from suppliers; storing them in warehouses or distribution centers; and transporting them to stores or customers; managing sales. They also collect, store, and analyze data to optimize supply chain performance. Retailers have teams responsible for managing each stage of the supply chain��
]]>Learn about the latest AI and data science breakthroughs from leading data science teams at NVIDIA GTC 2023.
]]>According to IDC, the volume of data generated each year is growing exponentially. IDC��s Global DataSphere projects that the world will generate 221 ZB of data by 2026. This data holds fantastic information. But as the volume of data grows, so does the processing cost. As a data scientist or engineer, you��ve certainly felt the pain of slow-running, data-processing jobs.
]]>NVIDIA revealed major updates to its suite of AI software for developers including JAX, NVIDIA CV-CUDA, and NVIDIA RAPIDS. To learn about the latest SDK advancements from NVIDIA, watch the keynote from CEO Jensen Huang. Just today at GTC 2022, NVIDIA introduced JAX on NVIDIA AI, the newest addition to its GPU-accelerated deep learning frameworks. JAX is a rapidly growing��
]]>Learn about the latest AI and data science breakthroughs from the world��s leading data science teams at GTC 2022.
]]>RAPIDS Accelerator for Apache Spark v21.10 is now available! As an open source project, we value our community, their voice, and requests. This release constitutes community requests for operations that are ideally suited for GPU acceleration. Important callouts for this release: RAPIDS Accelerator for Apache Spark is growing at a great pace in both functionality and��
]]>NVIDIA GTC is the must attend AI conference for developers. It��s a place where practitioners, leaders, and innovators share their ideas about the latest trends in data science. Here are six top data science GTC sessions worth attending. Thursday, Nov 11, 5:00 AM �C 5:25 AM PST Domino��s Pizza delivers thousands of pizzas a day and requires real-time planning and logistics capabilities.
]]>The August release (21.08) of RAPIDS Accelerator for Apache Spark is now available. It has been a little over a year since the first release at NVIDIA GTC 2020. We have improved in so many ways, particularly in terms of ease-of-use with minimal to no-code change for Apache Spark applications. This last year, the team has been focused on adding both functionality and continuously improving��
]]>Azure recently announced support for NVIDIA��s T4 Tensor Core Graphics Processing Units (GPUs) which are optimized for deploying machine learning inferencing or analytical workloads in a cost-effective manner. With Apache Spark deployments tuned for NVIDIA GPUs, plus pre-installed libraries, Azure Synapse Analytics offers a simple way to leverage GPUs to power a variety of data processing and��
]]>RAPIDS Accelerator for Apache Spark v21.06 is here! You may notice right away that we��ve had a huge leap in version number since we announced our last release. Don��t worry, you haven��t missed anything. RAPIDS Accelerator is built on cuDF, part of the RAPIDS ecosystem. RAPIDS transitioned to calendar versioning (CalVer) in the last release, and, from now on, our releases will follow the same��
]]>Editor��s Note: Get notified and be the first to download our real-world blueprint once it��s available. This is the third installment in a series describing an end-to-end blueprint for predicting customer churn. In previous installments, we��ve discussed some of the challenges of machine learning systems that don��t appear until you get to production: In the first installment��
]]>Recommender systems drive engagement on many of the most popular online platforms. As data volume grows exponentially, data scientists increasingly turn from traditional machine learning methods to highly expressive, deep learning models to improve recommendation quality. Often, the recommendations are framed as modeling the completion of a user-item matrix, in which the user-item entry is the��
]]>Editor��s Note: Get notified and be the first to download our real-world blueprint once it��s available. If you want to solve a particular kind of business problem with machine learning, you��ll likely have no trouble finding a tutorial showing you how to extract features and train a model. However, building machine learning systems isn��t just about training models or even about finding the best��
]]>With the growing interest in deep learning (DL), more and more users are using DL in production environments. Because DL requires intensive computational power, developers are leveraging GPUs to do their training and inference jobs. Recently, as part of a major Apache Spark initiative to better unify DL and data processing on Spark, GPUs became a schedulable resource in Apache Spark 3.
]]>Apache Spark has emerged as the standard framework for large-scale, distributed, data analytics processing. NVIDIA worked with the Apache Spark community to accelerate the world��s most popular data analytics framework and to offer revolutionary GPU acceleration on several leading platforms, including Google Cloud, Databricks, and Cloudera. Now, Amazon EMR joins the list of leading platforms��
]]>Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark, these capabilities are extended to GPUs. However, prior to this work, all CUDA operations happen in the default stream, causing implicit synchronization and not taking advantage of concurrency on the GPU. In this post, we look at how to use��
]]>Machine learning (ML) data is big and messy. Organizations have increasingly adopted RAPIDS and cuML to help their teams run experiments faster and achieve better model performance on larger datasets. That, in turn, accelerates the training of ML models using GPUs. With RAPIDS, data scientists can now train models 100X faster and more frequently. Like RAPIDS, we��ve ensured that our data logging��
]]>At GTC Spring 2020, Adobe, Verizon Media, and Uber each discussed how they used Spark 3.0 with GPUs to accelerate and scale ML big data pre-processing, training, and tuning pipelines. There are multiple challenges when it comes to the performance of large-scale machine learning (ML) solutions: huge datasets, complex data preprocessing and feature engineering pipelines��
]]>Apache Spark continued the effort to analyze big data that Apache Hadoop started over 15 years ago and has become the leading framework for large-scale distributed data processing. Today, hundreds of thousands of data engineers and scientists are working with Spark across 16,000+ enterprises and organizations. One reason why Spark has taken the torch from Hadoop is because it can process data��
]]>Given the parallel nature of many data processing tasks, it��s only natural that the massively parallel architecture of a GPU should be able to parallelize and accelerate Apache Spark data processing queries, in the same way that a GPU accelerates deep learning (DL) in artificial intelligence (AI). NVIDIA has worked with the Apache Spark community to implement GPU acceleration through the��
]]>Did you see the White House��s recent initiative on Precision Medicine and how it is transforming the ways we can treat cancer? Have you avoided clicking on a malicious website based on OpenDNS��s SecureRank predictive analytics? Are you using the Wikidata Query Service to gather data to use in your machine learning or deep learning application? If so, you have seen the power of graph applications.
]]>