NVIDIA NeMo Curator 實現高質量越南語數據處理

開源大語言模型（LLMs）在英語方面表現出色，但難以與其他語言（尤其是東南亞語言）搭配使用。這主要是由于缺乏這些語言的訓練數據、對當地文化的理解有限，以及 token 不足以捕捉獨特的語言結構和表達。

為了充分滿足客戶需求，非英語國家地區的企業必須超越通用模型，并對其進行定制，以捕捉當地語言的細微差別，確保客戶體驗無縫且有影響力。

在這篇博文中，我們將探討 Viettel Solutions （Viettel Corporation 快速發展的子公司）如何利用 NVIDIA NeMo Curator 處理高質量的越南語數據來訓練 Llama 3 ViettelSolution 8B，這是一種先進的 LLM，現在排名在 VMLU 排行榜的前列。NeMo Curator 是一款 GPU 加速的數據管護工具，可為預訓練 LLM 提供大規模、高質量的數據集。

在這一過程中，關鍵的第一步是精心策劃大規模的高質量數據集。本文將指導您完成所使用的數據管護工作流，包括每個階段的示例代碼和詳細的探索性數據分析（EDA），以說明每個步驟的影響。博文結束后，您將擁有清晰的路線圖和參考，以便輕松開始使用 NeMo Curator，無論是越南語還是其他語言。

Viettel Solutions 是為越南政府和企業提供數字化轉型解決方案的先驅，專注于滿足各行各業對采用人工智能（AI）的日益增長的需求。Viettel 的愿景是引領生成式人工智能領域的發展，并為客戶開發人工智能賦能的產品，Viettel 與 NVIDIA NeMo Curator 團隊開展了合作。

Viettel Solutions 數據分析主管 Tuan Nguyen 表示：“NeMo Curator 的 GPU 加速功能 (包括 exact 和 fuzzy deduplication 以及 heuristic 和 classifier filtering) 將準確性提高了 10%，將訓練時間縮短了三倍，并將數據集大小減少了 60%。”

預備知識和環境設置?

如要遵循本文中介紹的步驟，請確保您已進行以下設置：

CUDA 和 NVIDIA 驅動程序：CUDA 12.3 與驅動程序 545.23.08
Ubuntu 22.04
NVIDIA 容器工具包版本 1.15.0

安裝?

首先，按照 NeMo Curator 存儲庫的 README 文件中的說明安裝 CPU 和 CUDA 加速模塊的說明安裝 NeMo Curator。

接下來，安裝 datasets 和 jsonlines 包，這些包稍后會用到。

pip install datasets
pip install jsonlines

要繼續進行數據處理，需要設置 Dask 環境。Dask 是一個靈活的開源庫，可在 Python 中實現并行和分布式計算，使您能夠跨多個核心甚至集群擴展計算。通過分配任務，Dask 顯著提高了數據處理過程的速度和效率。

我們在搭載 128 核 CPU 和 2TB RAM 的 NVIDIA DGX A100 上運行此實驗，以處理數據集大小。根據您的數據集和計算資源，您可能需要相應地調整 Dask Worker 配置。您可以使用以下命令啟動 Dask 集群：

import nemo_curator
from dask.distributed import Client, LocalCluster
# Start a Dask cluster with 12 workers, each limited at 64GB of memory.  You might need to adjust these numbers according to your computing resources
cluster = LocalCluster(n_workers=12, processes=True, memory_limit= '64GB')
client = Client(cluster)

數據處理流程概述?

數據管護管道包括以下關鍵步驟：

下載和分片 ：從各種來源下載數據集，然后進行組合和分片，以實現高效的分布式處理。
Unicode 重新格式化 ：文本被標準化為一致的 Unicode 格式。
精確的重復數據刪除 ：刪除精確的重復數據以減少冗余。
Quality filtering
- 啟發式過濾 ：應用基于規則的過濾器以刪除低質量內容。
- 基于分類器的過濾 ：使用機器學習根據質量對文檔進行分類和過濾。

Data processing pipeline with NeMo Curator. The pipeline includes these key steps, Download and Sharding, Unicode Reformatting, Exact Deduplication and Quality Filtering. — *圖 1、使用 NeMo Curator 構建的數據處理流程。*

數據采集?

我們從多個數據集獲取內容，以豐富大型語言模型（LLMs）的訓練數據的多樣性和數量。這些數據集包括：

C4 數據集的越南語子集，是一個龐大且多樣化的網絡爬網文本數據集合。
OSCAR 數據集版本 23.01 的越南語子集，是 web-crawled 數據的聚合。
維基百科的越南文文章，提供結構化和信息豐富的內容。
越南新聞語料庫，提供與當地相關的新聞文章。

每個數據集都可通過 Hugging Face Hub 訪問和下載，由于 OSCAR 受到訪問限制，還需要執行其他步驟。請注意，OSCAR 需要接受數據集頁面上的條件；然后使用 Hugging Face 訪問令牌進行下載。

下載數據集并將其轉換為 Parquet?

Parquet 已針對像 Dask 這樣的分布式系統進行了優化，支持輕松分區和并行處理，從而提高處理大規模數據時的性能。為了本文的目的，所有數據集階段都將以 Parquet 格式保存。

以下代碼片段從 Hugging Face 下載數據集，并將其另存為 Parquet 文件。

import os
from datasets import load_dataset as load_hf_dataset
from datasets import DownloadConfig 
 
data_dir = "./datasets/"
download_config = DownloadConfig(num_proc=4)
 
# Load and save Vietnamese Wikipedia dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
ds["train"].to_parquet(os.path.join(data_dir, "wiki_vi_231101.parquet"))
 
# Load and save Vietnamese news corpus
ds = load_hf_dataset("jetaudio/binhvq_news")
ds["train"].to_parquet(os.path.join(data_dir, "binhvq_news_train.parquet"))
 
# Load and save OSCAR dataset
ds = load_hf_dataset("oscar-corpus/OSCAR-2301", language="vi", token=True, download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, 'oscar_vi.parquet'))
 
# Load and save C4 dataset
ds = load_hf_dataset("allenai/c4", data_files='multilingual/c4-vi.*.json.gz', download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, "c4_vi.parquet"))

Pie chart showing proportion of the raw dataset by sources. About 67% of the data comes from C4, 17% from News corpus, 15% from Oscar dataset, and the remaining 1% are Wikipedia articles. — *圖 2. 按來源劃分的原始數據集比例*

我們利用 NeMo Curator 域分類器模型將文檔分類為支持的 26 個域之一。如圖 3 所示，分布相對均勻，許多域占總數據的 3% 到 6% 之間。這表明數據集非常多樣化，涵蓋廣泛的主題，這有助于預訓練通用語言模型。

This pie chart illustrates the domain distribution within the raw dataset as identified by a domain classifier model. The largest domain is Business and Industrial at 7.1%, followed closely by Arts and Entertainment at 6.8% and News at 6.4%. Other significant categories include Health at 6.2% and Sensitive Subjects at 5.7%. Smaller domains represented are Shopping (2.3%) and Games (2.5%), highlighting the diverse content within the dataset. — *圖 3. 域分類器模型識別的原始數據集中的域比例*

合并和標準化格式?

下載數據集后，下一步是對所有來源的數據進行標準化和格式化。這些數據組合成一個數據集，只保留“文本”字段，因為用于訓練模型的所有文本數據都在此字段中。非文本數據和其他信息通常無法幫助完成此任務。

from datasets import concatenate_datasets
# Combine datasets and standardize format
datasets = [os.path.join(data_dir, file) for file in ["wiki_vi_231101.parquet", "c4_vi.parquet", "oscar_vi.parquet", "binhvq_news_train.parquet"]]
 
data_files = {"train": datasets[0]}
ds = load_hf_dataset("parquet", data_files=data_files)
ds = ds["train"].remove_columns([col for col in ds["train"].column_names if col != "text"])
 
for d in datasets[1:]:
    ds_ = load_hf_dataset("parquet", data_files={"train": d})
    ds_ = ds_["train"].remove_columns([col for col in ds_["train"].column_names if col != "text"])
    ds = concatenate_datasets([ds, ds_])

將組合數據集分片?

然后，將組合數據集分解成較小的塊。執行分片以在 Dask 集群中的多個工作者之間均勻分布數據，從而在數據管護階段促進高效的并行處理。

# Define paths for raw data
raw_data_directory = os.path.join(data_dir, "raw")
 
# Shard the dataset
num_shards = 256
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(raw_data_directory, f"{shard_idx}.parquet"))

使用 NeMo Curator 進行高質量數據處理?

本節介紹我們從 NeMo Curator 中使用的不同技術。Unicode 重新格式化、精確重復數據刪除、啟發式過濾和基于分類器的過濾用于處理此數據集并將其細化為高質量的最終版本。

Unicode 重新格式化

Unicode 重新格式化是必要的預處理步驟，可確保文本數據標準化，并且不會出現編碼錯誤，這在網絡爬網數據集中很常見。以下代碼演示了如何使用 NeMo Curator 執行 Unicode 重新格式化：

from nemo_curator import Modify
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.distributed_utils import read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.datasets import DocumentDataset
 
# Define paths for Unicode formatted data
unicode_formatted_output_path = os.path.join(data_dir, "formatted")
 
def load_dataset(input_data_dir, file_type="parquet"):
    files = list(get_all_files_paths_under(input_data_dir))
    raw_data = read_data(files, file_type=file_type, backend="pandas", add_filename=True)
    dataset = DocumentDataset(raw_data)
 
    return dataset
 
# Load the raw data
raw_data = load_dataset(raw_data_directory, file_type="parquet")
 
# Initialize the Unicode reformatter
cleaner = Modify(UnicodeReformatter())
 
# Apply Unicode reformatting
cleaned_data = cleaner(raw_data)
 
# Save the cleaned data to disk
write_to_disk(cleaned_data.df, unicode_formatted_output_path, write_to_filename=True, output_type="parquet")

向文檔添加自定義 ID?

在繼續進一步的數據集管護步驟之前，建議通過向每個文檔添加唯一 ID 來對數據集進行預處理。這些 ID 可充當追蹤器，幫助在整個管護過程中識別重復文檔或低質量文檔，確保每個文檔在整個處理過程中保持唯一身份識別。

NeMo Curator 提供了一個 AddId 類，允許用戶使用指定的前綴格式 (例如 <prefix>_<id>) 將自定義 ID 插入文檔。以下代碼片段演示了此步驟：

from nemo_curator import AddId
 
# Define paths for input data and output with added IDs
add_id_input_data_dir = unicode_formatted_output_path
added_id_output_path = os.path.join(data_dir, "add_id")
add_ID_id_prefix = "VI_"
 
# Load the formatted dataset
dataset = DocumentDataset.read_parquet(add_id_input_data_dir)
 
# Initialize the AddId class with a specified prefix and start index
add_id = AddId(id_field='id', id_prefix=add_ID_id_prefix, start_index=0)
 
# Apply the ID addition to the dataset
id_dataset = add_id(dataset)
 
# Save the dataset with added IDs to disk
write_to_disk(id_dataset.df, output_file_dir=added_id_output_path, write_to_filename=True, output_type="parquet")

精確的重復數據刪除?

精確的重復數據刪除功能可從數據集中刪除相同的重復數據。通過消除精確的重復，我們可以確保每個數據點對訓練過程的貢獻獨一無二，從而增強數據集的多樣性和整體質量。

此階段利用 GPU 加速使用 GPU Dask 集群。當前集群基于 CPU，因此必須關閉集群，并在 GPU 支持下啟動新集群。

要關閉現有的集群，請使用以下代碼：

client.cluster.close()
client.shutdown()

然后初始化 GPU Dask 集群：

os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"
 
from nemo_curator.utils.distributed_utils import get_client
 
def pre_imports():
    import cudf 
 
client = get_client(cluster_type='gpu', set_torch_to_use_rmm=False)
client.run(pre_imports)

精確重復數據刪除的實現如下所示：

from nemo_curator.modules import ExactDuplicates
 
# Define input and output paths
exact_dedup_input_dataset_dir = added_id_output_path
exact_dedup_base_output_path = os.path.join(data_dir, "exact_dedup")
exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path, "log")
exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path, "data")
deduped_output_dir = os.path.join(data_dir,"remove_duplicate")
 
# Create directories for logs and output
!mkdir -p {exact_dedup_log_dir}
!mkdir -p {exact_dedup_output_dir}
!mkdir -p {deduped_output_dir}
 
# Parameters for ExactDuplicates
exact_dedup_dataset_id_field = "id"
exact_dedup_dataset_text_field = "text"
 
# Load the input dataset
input_dataset = DocumentDataset.read_parquet(exact_dedup_input_dataset_dir, backend="cudf")
 
# Initialize and run exact deduplication
exact_dup = ExactDuplicates(
    logger=exact_dedup_log_dir,
    id_field=exact_dedup_dataset_id_field,
    text_field=exact_dedup_dataset_text_field,
    hash_method="md5",
    cache_dir=exact_dedup_output_dir
)
duplicates = exact_dup(dataset=input_dataset)
 
print(f"Number of exact duplicate files: {len(duplicates)}")
 
# Load the dataset,exact duplicates to identify and remove duplicate IDs
input_dataset = DocumentDataset.read_parquet(added_id_output_path, backend="cudf")
exact_duplicates = DocumentDataset.read_parquet(
    os.path.join(exact_dedup_output_dir, "_exact_duplicates.parquet"), backend="cudf")
 
# Extract list of duplicate document IDs
exact_docs_to_remove = exact_duplicates.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)
 
# Remove duplicated documents from the input dataset
result = input_dataset.df[
~input_dataset.df[exact_dedup_dataset_id_field].isin(exact_docs_to_remove[exact_dedup_dataset_id_field].compute())
]
 
# Save the final deduplicated dataset
write_to_disk(result, output_file_dir=deduped_output_dir, write_to_filename=True, output_type="parquet")

啟發式質量過濾?

啟發式質量過濾旨在根據預定義的啟發式刪除低質量內容，從而提高數據集的質量。這種方法包括對數據集應用一系列過濾器，以消除不需要的數據特征，例如過多的特殊字符、過短或過長的文本，或者其他可能會對模型性能產生負面影響的標準。

我們使用已配置的 YAML 文件來定義啟發式過濾器。該文件列出了用于構建過濾器工作流的過濾條件和設置。您可以根據需要自定義過濾器或更改閾值。filter_pipeline 輔助程序會讀取 YAML 設置，并逐步將每個過濾器應用于數據集。

from nemo_curator.utils.config_utils import build_filter_pipeline
import warnings
 
# Define paths for input data and output data after heuristic filtering
HF_input_data_dir = deduped_output_dir
HF_output_path = os.path.join(data_dir, "heuristic_filtering")
 
# Create a directory for the configuration file if it doesn't exist
os.makedirs("config", exist_ok=True)
# Download the YAML configuration file for heuristic filtering
!wget https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml -O ./config/heuristic_filter_non-en.yaml
 
# Specify the path to the configuration file
filter_config_file = "./config/heuristic_filter_non-en.yaml"
os.makedirs(HF_output_path, exist_ok=True)
 
# Load the filters from the YAML configuration file
filter_pipeline = build_filter_pipeline(filter_config_file)
 
# Load the dataset
dataset = DocumentDataset.read_parquet(HF_input_data_dir, backend="pandas")
 
# Suppress specific warnings during filtering
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    # Apply the heuristic filters to the dataset
    result_data = filter_pipeline(dataset)
  
    # Save the filtered dataset to disk
    result_data.to_parquet(HF_output_path, write_to_filename=True)

令牌數量分布?

現在，檢查啟發式過濾如何改變數據集。在過濾之前，數據集包含各種文本長度，有些文檔短到幾個令牌，有些則擴展到超過 16K 令牌。經過過濾后，數據集的文本長度和令牌數量分布更加一致。該過濾過程有效地刪除了極短的文檔（例如少于 64 個令牌的文檔），并縮減了可能包含冗余或不相關內容的過長文檔。

This histogram illustrates the frequency of Log2 token counts per sample in the deduplicated and heuristic-filtered datasets, highlighting the removal of extremely short and overly long documents. In the deduplicated dataset, the highest frequency occurs at Log2 = 10 with 29.6 million samples, while heuristic filtering reduces this to 22.3 million. For Log2 = 8, the frequency drops from 19.7 million in the deduplicated dataset to 12.8 million after filtering. At the lower end, the frequency for Log2 = 4 is 0.6 million in the deduplicated dataset but eliminated in the heuristic-filtered dataset. — *圖 4、樣本長度分布比較* （ *通過重復數據刪除和啟發式篩選數據集中每個樣本的令牌數量的 log2 衡量* ）

基于角色的指標?

圖 5 顯示了每個指標在啟發式管理前后數據集的比較分析。

This set of box plots shows the effect of heuristic filtering on symbol, number, and whitespace percentages in the dataset, demonstrating significant noise reduction. In the raw data, the maximum symbol percentage reaches 99.44%, while heuristic filtering reduces it to 81.61%. The number percentage also drops from a maximum of 91.53% in the raw data to 15.40% after filtering. For whitespace percentage, the maximum value decreases from 76.06% to 25.88%, indicating more consistent text formatting post-filtering. — *圖 5. 啟發式策劃前后的符號、數字和空白百分比箱線圖*

箱線圖突出顯示了啟發式過濾后異常值的顯著減少。對于符號，第 95 百分位數從 8.84% 降至 5.47%，而對于數字，則從 11.19% 降至 6.14%。空白部分的最大降幅也從 76.06% 降至 25.88%，第 95 百分位保持穩定。這些降低表明，啟發式過濾可有效地定位并移除具有高比例符號、數字或空格的噪聲數據，從而提高整體數據集質量。

This chart presents box plots of word counts (left) and mean word length (right) before and after heuristic filtering, showing how extreme outliers are reduced. In the raw dataset, the maximum word count is 163,822, while the heuristic-filtered dataset has a reduced maximum of 100,948. The median word count drops from 459 in the raw dataset to 574 post-filtering, indicating a focus on more substantial text samples. For mean word length, the maximum decreases from 9.24 to 5.29, reflecting a more consistent text quality after filtering. — *圖 6、啟發式過濾前后的詞計數和平均詞長框圖*

過濾刪除了非常長的文檔，但文檔之間的總體字數分布仍然相似。這表明刪除了帶有畸形標記的異常長文檔。

基于分類器的質量過濾?

啟發式過濾使用簡單的規則移除低質量內容，但無法捕捉更復雜的質量模式。基于分類器的過濾使用經過訓練的分類器模型將內容分類為高質量或低質量，從而以更智能、更靈活的方式處理簡單規則可能會忽略的各種數據集。

為訓練分類器準備數據

訓練質量分類器需要高質量和低質量內容的代表性樣本。對于高質量數據，我們使用了維基百科越南語版的文章，這些文章通常結構合理且可靠。低質量樣本來自未經過濾的爬網式越南新聞語料庫。

數據準備方式如下：

# Paths for high-quality and low-quality sample data
hq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/hq")
lq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/lq")
 
# Load and shard the high-quality dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
num_shards = 8
for shard_idx in range(num_shards):
    shard = ds["train"].shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(hq_samples_path, f"{shard_idx}.parquet"))
 
# Load and shard the low-quality dataset
ds = load_hf_dataset("vietgpt/binhvq_news_vi",split="train[:100000]")
num_shards = 32
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(lq_samples_path, f"{shard_idx}.parquet"))

訓練分類器

分類器使用 FastText 進行訓練，FastText 提供了一種高效的文本分類方法。以下是使用標記為高質量和低質量的樣本訓練分類器的方法：

from nemo_curator.modifiers import FastTextLabelModifier
import fasttext
import random
 
# Function to create labeled samples
def create_samples(data_path, label, num_samples):
    raw_dataset = DocumentDataset.read_parquet(data_path, backend='pandas')
    label_quality = Modify(FastTextLabelModifier(label))
    labeled_dataset = label_quality(raw_dataset)
    labeled_samples = labeled_dataset.df.sample(frac=num_samples / len(labeled_dataset.df))
    return labeled_samples["text"].compute().values.tolist()
 
# Prepare training data
low_quality_samples = create_samples(lq_samples_path, "__label__lq", 100000)
high_quality_samples = create_samples(hq_samples_path, "__label__hq", 100000)
train_samples = low_quality_samples + high_quality_samples
random.shuffle(train_samples)
 
# Save training data to a file
train_file = "./cf_model_fasttext.train"
with open(train_file, "w", encoding="utf-8") as f:
    for sample in train_samples:
        f.write(sample + "\n")
 
# Train the FastText classifier
model = fasttext.train_supervised(input=train_file, lr=0.01, dim=100, epoch=5, wordNgrams=2)
model_path = "./cf_model_fasttext_model.bin"
model.save_model(model_path)

對數據集進行分類和篩選

經過訓練后，分類器將用于篩選數據集，根據學習到的區別將文檔分為高質量和低質量：

from nemo_curator.filters import FastTextQualityFilter
from nemo_curator import ScoreFilter
 
# Define paths and load the dataset
CF_input_data_dir = HF_output_path
CF_output_path = os.path.join(data_dir, "classifier_filtering/output")
target_dataset = DocumentDataset.read_parquet(CF_input_data_dir, "parquet")
 
# Set up the filtering pipeline
filter_pipeline = ScoreFilter(FastTextQualityFilter(model_path), score_field="quality_score", score_type=float)
filtered_dataset = filter_pipeline(target_dataset)
 
# Save the filtered dataset
write_to_disk(filtered_dataset.df, output_file_dir=CF_output_path, write_to_filename=True, output_type="parquet")

刪除敏感和情感數據

成人和敏感主題領域以及積極和消極情緒都顯著減少。這使得模型更安全、更中立，并且能夠更好地處理各種情境并做出適當的反應。

This visualization shows the impact of classifier-based filtering in removing sensitive and sentimental data. In the heuristic-filtered dataset, 4542.31K samples belong to the "Sensitive Subjects" domain, which reduces to 273K samples after classifier-based filtering, accounting for 6.97% and 2.02% respectively. For "Adult" content, the count drops from 481.03K in heuristic filtering to 31K in classifier-based filtering, representing 0.74% and 0.23%. On the right, sentiment filtering shows a reduction in "Positive" samples from 4640K in heuristic filtering to 744K, and "Negative" samples reduce from 524K to 77K, making up 7.12% and 5.63% in heuristic filtering versus 0.80% and 0.59% in classifier-based filtering. — *圖 7、應用基于分類器的過濾前后的敏感領域計數（左）和情感樣本計數（右）*

保留內容多樣性

再次運行域分類器模型，并檢查數據集內容的多樣性。現在，數據集顯示了跨領域的均衡分布，其中大多數領域占數據的 3%至 8%。從新聞和法律到游戲和汽車等專業領域，這種多樣性可以確保模型能夠處理各種主題。即使經過過濾以提高質量并刪除有害內容，基本的多樣性也得以保留，這對于構建通用的通用語言模型至關重要。

This pie chart illustrates the domain distribution in the final dataset as identified by a domain classifier model. The largest domain is Arts and Entertainment at 7.86%, followed by Health at 7.42%, and People and Society at 7.07%. Other notable categories include Sports (6.66%), News (6.46%), and Food and Drink (5.7%). Smaller domains include Pets and Animals at 1.73% and Finance at 2.67%. — *圖 8. 域分類器模型識別的最終數據集中的域比例*

在每個階段結束后減小數據集大小

大約 90% 的數據集被刪除，這些文檔的樣本質量較低、噪聲較大或格式錯誤。這種選擇性過濾可確保訓練數據具有最高質量。降幅最大的是基于分類器的過濾（45.43%），這表明大量內容被標記為質量較低或有害，并在此階段被移除。啟發式過濾占數據刪除量的 35.74%，目標是樣本長度、重復 n-grams 和噪聲等問題。精確重復數據刪除過濾了一小部分數據（8.31%）。

This bar chart displays the proportions of data filtered at each phase across different datasets, showing that 90% of data was removed to ensure high-quality training data. In the "All" dataset, 45.43% of data was removed through classifier-based filtering, 35.74% through heuristic filtering, and 8.31% due to duplication. For the "Binhvq_News" dataset, classifier-based filtering accounted for the largest reduction at 48.04%, followed by heuristic filtering at 36.6%. The "Wiki_Vietnamese" dataset saw 49.59% of data removed via classifier-based filtering and 37.31% via heuristic filtering, with 11.31% due to duplication. — *圖 9. 在四個不同數據集的每個管護階段過濾掉的數據量的明細*

嵌入可視化

最終數據集的分布與原始數據集類似。主題的多樣性仍然得到保留，大多數領域的代表性仍然很好。在完成此步驟后，一些較小的集群的定義略顯偏高，這可能是由于刪除了低質量或有害內容。

通過啟發式過濾和基于分類器的過濾，數據集保持了廣泛的領域多樣性。表示特定領域的不同聚類仍然定義明確，而更通用和重疊的領域繼續顯示互連，從而確保數據集保持平衡和全面，便于預訓練目的。

This image showcases UMAP visualizations comparing a 5% sample of the raw dataset (left) and a 5% sample of the classifier-based filtered dataset (right). The visualizations demonstrate that the final dataset maintains domain diversity even after filtering, with well-distributed clusters for each domain. Both visualizations depict a range of 26 domains, such as Arts and Entertainment, Health, and News, represented by distinct colors. The classifier-based filtering retains the main structure and diversity of the raw dataset, ensuring well-represented domains post-filtering. — *圖 10、原始數據集的 5%（左）和基于分類器的過濾數據集的 5%（右）的 UMAP 可視化*

結束語?

這篇博客文章展示了用于越南文本數據的數據管護工作流 Viettel Solutions ，以及一項分析，以探索管護過程的每個階段對數據集的影響。該流程使用 NVIDIA NeMo Curator ，這是一種寶貴的工具，可用于準備預訓練語言模型的大型數據集，同時注重質量、效率和可擴展性。它在數據管護過程中具有一系列顯著優勢，包括：

使用啟發式和基于分類器的過濾器消除噪聲和有害內容，從而提高數據集質量。
保留數據集的基本結構，確保核心特征在經過整理后保持不變。
適應不同的數據集，為每個語料庫提供量身定制的方法。

如需查看本文使用的完整代碼，請參閱 Jupyter Notebook 。查看 NeMo Curator 示例腳本，了解其他技術，例如 Fuzzy Deduplication 和 PII redaction。

NVIDIA NeMo Curator 實現高質量越南語數據處理

預備知識和環境設置?

安裝?

數據處理流程概述?

數據采集?

下載數據集并將其轉換為 Parquet?

合并和標準化格式?

將組合數據集分片?

使用 NeMo Curator 進行高質量數據處理?

Unicode 重新格式化

向文檔添加自定義 ID?

精確的重復數據刪除?

啟發式質量過濾?

令牌數量分布?

基于角色的指標?

基于分類器的質量過濾?

為訓練分類器準備數據

訓練分類器

對數據集進行分類和篩選

刪除敏感和情感數據

保留內容多樣性

在每個階段結束后減小數據集大小

嵌入可視化

結束語?

相關資源

標簽

關于作者

NVIDIA NeMo Curator 實現高質量越南語數據處理

預備知識和環境設置?

安裝?

數據處理流程概述?

數據采集?

下載數據集并將其轉換為 Parquet?

合并和標準化格式?

將組合數據集分片?

使用 NeMo Curator 進行高質量數據處理?

Unicode 重新格式化

向文檔添加自定義 ID?

精確的重復數據刪除?

啟發式質量過濾?

令牌數量分布?

基于角色的指標?

基于分類器的質量過濾?

為訓練分類器準備數據

訓練分類器

對數據集進行分類和篩選

刪除敏感和情感數據

保留內容多樣性

在每個階段結束后減小數據集大小

嵌入可視化

結束語?

相關資源

標簽

關于作者

相關文章

利用 NVIDIA NeMo Curator 整理非英語數據集以訓練 LLM

借助 NVIDIA NeMo Curator 擴展和整理用于 LLM 訓練的高質量數據集

相關文章

借助 NVIDIA cuBLAS 12.9 提高矩陣乘法速度和靈活性

NVIDIA NIM Operator 2.0 借助 NVIDIA NeMo 微服務支持提高 AI 部署效率

選擇您的第一個本地人工智能項目

構建應用程序以安全使用 KV 緩存

聚焦：個人 AI 借助 NVIDIA Riva 為小企業主帶來 AI 接待員