適用于數據科學的 GPU 加速入門

在數據科學領域，運營效率是處理日益復雜和大型數據集的關鍵。GPU 加速已成為現代工作流程的關鍵，可顯著提高性能。

RAPIDS 是由 NVIDIA 開發的一套開源庫和框架，旨在使用 GPU 以盡可能減少代碼更改來加速數據科學流程。RAPIDS 提供用于數據操作的 cuDF 、用于機器學習的 cuML 和用于圖形分析的 cuGraph 等工具，可實現與現有 Python 庫的無縫集成，使數據科學家更容易實現更快、更高效的處理。

本文分享了從 CPU 數據科學庫過渡到 GPU 加速工作流程的技巧，特別適合經驗豐富的數據科學家。

在桌面或云基礎架構上設置 RAPIDS

開始使用 RAPIDS 非常簡單，但它確實有幾個依賴項。推薦的方法是遵循官方的 RAPIDS 安裝指南，該指南提供了詳細的本地安裝說明。您有多種安裝框架的路徑：通過 pip install、Docker 鏡像，或通過 Conda 等環境。要在云環境中設置 RAPIDS，請參閱 RAPIDS 云部署指南。安裝前，請檢查安裝頁面上的 CUDA 版本和受支持的 RAPIDS 版本，確保兼容性。

適用于 pandas 的 cuDF 和 GPU 加速

RAPIDS 的一個優勢在于其模塊化架構，使用戶能夠采用專為 GPU 加速工作流程設計的特定庫。其中，cuDF 作為一款功能強大的工具脫穎而出，可從基于 pandas 的傳統工作流程無縫過渡到 GPU 優化的數據處理流程，并且無需更改代碼。

首先，請確保在導入 pandas 之前啟用 cuDF 擴展，以便在 GPU 上執行數據導入和剩余操作。通過使用 %load_ext cudf.pandas 加載 RAPIDS 擴展程序，您可以輕松地將 cuDF DataFrame 集成到現有工作流程中，從而保留熟悉的 pandas 語法和結構。

與 pandas 類似， cuDF pandas 支持不同的文件格式，例如 .csv、.json、.pickle、.paraquet，因此支持 GPU 加速的數據操作。

以下代碼是如何啟用 cudf.pandas 擴展名并連接兩個 .csv 文件的示例：

%load_ext cudf.pandas

import pandas as pd 
import cupy as cp 
  
train = pd.read_csv('./Titanic/train.csv') 
test = pd.read_csv('./Titanic/test.csv') 
concat = pd.concat([train, test], axis = 0) 

通過加載 cudf.pandas 擴展程序，無需更改或重寫代碼，即可在 GPU 上執行熟悉的 pandas 操作，例如過濾、分組和合并。cuDF 加速器與 pandas API 兼容，可確保從 CPU 到 GPU 的平穩過渡，同時大幅提高計算速度。

target_rows = 1_000_000
repeats = -(-target_rows // len(train))  # Ceiling division
train_df = pd.concat([train] * repeats, ignore_index=True).head(target_rows)
print(train_df.shape)  # (1000000, 2)
 
repeats = -(-target_rows // len(test))  # Ceiling division
test_df = pd.concat([test] * repeats, ignore_index=True).head(target_rows)
print(test_df.shape)  # (1000000, 2)
 
combine = [train_df, test_df]
 
 
(1000000, 12)
(1000000, 11)

filtered_df = train_df[(train_df['Age'] > 30) & (train_df['Fare'] > 50)] 
grouped_df = train_df.groupby('Embarked')[['Fare', 'Age']].mean() 
additional_info = pd.DataFrame({ 
    'PassengerId': [1, 2, 3], 
    'VIP_Status': ['No', 'Yes', 'No'] 
  }) 
merged_df = train_df.merge(additional_info, on='PassengerId', 
how='left')

解碼性能：CPU 和 GPU 運行時指標的實際應用

在數據科學中，性能優化不僅涉及速度，還涉及了解計算資源的利用方式。其中包括分析運營如何利用 CPU 和 GPU 架構、識別效率低下問題，以及實施旨在提高工作流程效率的策略。

%cudf.pandas.profile 等性能分析工具通過詳細檢查代碼執行情況發揮著關鍵作用。以下執行結果會對每個函數進行分解，并區分在 CPU 上處理的任務與在 GPU 上加速的任務：

%%cudf.pandas.profile
train_df[['Pclass', 'Survived']].groupby(['Pclass'], 
as_index=False).mean().sort_values(by='Survived', ascending=False)

        Pclass    Survived
0         1        0.629592
1         2        0.472810
2         3        0.242378
 
                         Total time elapsed: 5.131 seconds
                         5 GPU function calls in 5.020 seconds
                         0 CPU function calls in 0.000 seconds
 
                                       Stats
 
+------------------------+------------+-------------+------------+------------+-------------+------------+
| Function           | GPU ncalls  | GPU cumtime | GPU percall | CPU ncalls | CPU cumtime | CPU percall |
+------------------------+------------+-------------+------------+------------+-------------+------------+
| DataFrame.__getitem__ | 1          | 5.000       | 5.000      | 0          | 0.000       | 0.000      |
| DataFrame.groupby     | 1          | 0.000       | 0.000      | 0          | 0.000       | 0.000      |
| GroupBy.mean          | 1          | 0.007       | 0.007      | 0          | 0.000       | 0.000      |
| DataFrame.sort_values | 1          | 0.002       | 0.002      | 0          | 0.000       | 0.000      |
| DataFrame.__repr__    | 1          | 0.011       | 0.011      | 0          | 0.000       | 0.000      |
+------------------------+------------+-------------+------------+------------+-------------+------------+

這種粒度有助于查明無意中恢復到 CPU 執行的操作，這是由于不受支持的 cuDF 函數、不兼容的數據類型或次優內存處理而常見的情況。識別這些問題至關重要，因為此類回退會嚴重影響整體性能。如需詳細了解此加載程序，請參閱 Mastering cudf.pandas Profiler for GPU Acceleration 。

此外，您可以使用 Python magic 命令，如 %%time 和 %%timeit，來啟用特定代碼塊的基準測試，以便直接比較 pandas（CPU）和 cuDF 加速器（GPU）之間的運行時。這些工具可讓您深入了解通過 GPU 加速實現的效率提升。通過使用 %%time 進行基準測試，可以清楚地比較 CPU 和 GPU 環境之間的執行時間，從而凸顯通過并行處理實現的效率提升。

%%time 
  
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape) 
  
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1) 
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1) 
combine = [train_df, test_df] 
  
print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape) 

CPU output:
Before (999702, 12) (999856, 11) (999702, 12) (999856, 11)
After  (999702, 10) (999856, 9)  (999702, 10) (999856, 9)
 
CPU times: user 56.6 ms, sys: 8.08 ms, total: 64.7 ms
 
Wall time: 63.3 ms

GPU output:
Before (999702, 12) (999856, 11) (999702, 12) (999856, 11)
After  (999702, 10) (999856, 9)  (999702, 10) (999856, 9)
 
CPU times: user 6.65 ms, sys: 0 ns, total: 6.65 ms
 
Wall time: 5.46 ms

%%time 示例可將執行時間提高 10 倍，將墻面時間從 CPU 上的 63.3 毫秒 (ms) 縮短到 GPU 上的 5.46 毫秒。這凸顯了使用 cuDF pandas 進行 GPU 加速在大規模數據操作中的效率。您可以使用 %%timeit 獲得更多見解，它執行重復執行來測量性能指標中的一致性和可靠性。

%%timeit  
  
for dataset in combine: 
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\\.', expand=False) 
  
pd.crosstab(train_df['Title'], train_df['Sex']) 

CPU output:
1.11 s ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

GPU output:
89.6 ms ± 959 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

在 GPU 加速方面，%%timeit 示例將性能提升了 10 倍，將運行時間從 CPU 上的每循環 1.11 秒縮短到 GPU 上的每循環 89.6 毫秒。這凸顯了 cuDF pandas 在密集型數據操作中的效率。

驗證 GPU 利用率?

在處理不同的數據類型時，請務必驗證您的系統是否有效利用了 GPU。您可以使用熟悉的 type 命令來區分 NumPy 和 CuPy 數組，檢查數組是在 CPU 還是 GPU 上處理。

type(guess_ages)

cupy.ndarray

如果輸出為 np.array，則數據將在 CPU 上處理。如果輸出為 cupy.ndarray，則數據將在 GPU 上處理。此快速檢查可確保您的工作流程按預期利用 GPU 資源。

其次，只需使用 print 命令，即可確認是否正在利用 GPU，并確保正在處理 cuDF DataFrame。輸出指定使用的是 fast 路徑 (cuDF) 還是 slow 路徑 (pandas)。這種簡單的檢查提供了一種驗證 GPU 是否處于活動狀態以加速數據操作的簡單方法。

print(pd)

<module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))>

最后，可以使用 df.info 等命令檢查 cuDF DataFrame 的結構，并確認計算已通過 GPU 加速。這有助于驗證操作是在 GPU 上運行，還是回退至 CPU。

train_df.info()

<class 'cudf.core.dataframe.DataFrame'>
 
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column   Non-Null Count   Dtype   
---  ------   --------------   -----  
 0   Survived 1000000 non-null  int64   
 1   Pclass   1000000 non-null  int64   
 2   Sex      1000000 non-null  int64   
 3   Age      1000000 non-null  float64 
 4   SibSp    1000000 non-null  int64   
 5   Parch    1000000 non-null  int64   
 6   Fare     1000000 non-null  float64 
 7   Embarked 997755 non-null   object 
 8   Title    1000000 non-null  int64   
dtypes: float64(2), int64(6), object(1)
memory usage: 65.9+ MB

結束語?

通過 cuDF pandas 等工具，RAPIDS 可實現從基于 CPU 的傳統數據工作流到 GPU 加速處理的無縫過渡，從而顯著提高性能。通過利用 %%time、%%timeit 等功能以及 %%cudf.pandas.profile 等分析工具，您可以測量和優化運行時效率。通過 type、print(pd) 和 df.info 等簡單命令檢查 GPU 利用率，可確保工作流程有效利用 GPU 資源。

要嘗試本文中詳述的數據操作，請查看隨附的 Jupyter Notebook 。

如需了解有關 GPU 加速的數據科學的更多信息，請參閱“ 10 分鐘了解數據科學：在 RAPIDS cuDF 和 CuPy 庫之間過渡 ”以及“ RAPIDS cuDF 即時將 pandas 在 Google Colab 上的運行速度提高 50 倍 ”。

加入我們的 GTC 2025 大會，并報名參加 Data Science Track ，獲得更深入的見解。推薦的會議包括：