利用 NVIDIA NeMo Curator 整理用于 LLM 參數高效微調的自定義數據集

在最近的一篇博文中，我們討論了如何使用 NVIDIA NeMo Curator 整理自定義數據集，用于大型語言模型（LLMs）和小型語言模型（SLMs）的預訓練或連續訓練用例。

雖然此類訓練場景是 LLM 開發的重要組成部分，但許多下游應用都涉及在特定領域的數據集上微調現有基礎模型。這可以使用監督式微調 (SFT) 或參數高效微調 (PEFT) 方法 (如 LoRA 和 p-tuning) 來實現。

在這些工作流程中，您通常需要快速迭代并嘗試各種想法和超參數設置，以及如何處理訓練數據并將其公開給模型。您必須處理和整理數據集的多個變體，以確保對特定領域數據的細微差別進行有效學習。

由于在這種工作流程中可用的數據量有限，因此使用靈活的處理流程對高質量數據進行細化至關重要。

本文將指導您使用 NeMo Curator 創建自定義數據管護工作流，并著重介紹 SFT 和 PEFT 用例。有關 NeMo Curator 提供的基本構建塊的更多信息，請參閱使用 NVIDIA NeMo Curator 為 LLM 訓練管理自定義數據集。

概述

出于演示目的，本文重點介紹一個涉及電子郵件分類的玩具示例。目標是整理一個基于文本的小型數據集，其中每個記錄都包含電子郵件（主題和正文）以及該電子郵件的預定義分類標簽。

為此，我們使用了 Enron 電子郵件數據集，將每封電子郵件標記為八個類別之一。此數據集可在 Hugging Face 上公開獲取，并包含約 1400 條記錄。

數據管護流程涉及以下高級步驟：

定義下載器、迭代器和提取器類，將數據集轉換為 JSONL 格式。
使用現有工具統一 Unicode 表示。
定義自定義數據集過濾器，以刪除空白或過長的電子郵件。
編輯數據集中的所有個人識別信息 (PII)。
為每條記錄添加指令提示。
整合整個管線。

在消費級硬件上執行此策展制作流程需要不到 5 分鐘的時間。要訪問本教程的完整代碼，請參閱 NVIDIA/NeMo-Curator GitHub 資源庫。

預備知識

開始之前，您必須安裝 NeMo Curator 框架。按照 NeMo Curator GitHub README 文件中的說明安裝該框架。

接下來，運行以下命令以驗證安裝并安裝任何其他依賴項：

$ python -c "import nemo_curator; print(nemo_curator);"
$ pip3 install requests

定義自定義文檔構建器

整理數據集的第一步是實現文檔構建器，以便下載并迭代數據集。

下載數據集

實現DocumentDownloader類獲取數據集的 URL，并使用requests庫。

import requests
from nemo_curator.download.doc_builder import DocumentDownloader
 
class EmailsDownloader(DocumentDownloader):
    def __init__(self, download_dir: str):
        super().__init__()
 
        if not os.path.isdir(download_dir):
            os.makedirs(download_dir)
 
        self._download_dir = download_dir
        print("Download directory: ", self._download_dir)
 
    def download(self, url: str) -> str:
        filename = os.path.basename(url)
        output_file = os.path.join(self._download_dir, filename)
 
        if os.path.exists(output_file):
            print(f"File '{output_file}' already exists, skipping download.")
            return output_file
 
        print(f"Downloading Enron emails dataset from '{url}'...")
        response = requests.get(url)
 
        with open(output_file, "wb") as file:
            file.write(response.content)
 
        return output_file

下載的數據集是一個文本文件，每個條目大致遵循以下格式：

“<s>[system instruction prompts]
 
Subject:: [email subject]
Body:: [email body]
 
[category label] <s>”

您可以使用正則表達式輕松地將這種格式分解為其組成部分。要記住的關鍵是，條目由“<s> … <s>”并且始終以指令提示開始。此外，示例分隔符令牌和系統提示令牌與 Llama 2 標記器系列兼容。

由于您可能會將這些數據與不支持特殊令牌的其他分詞器或模型一起使用，因此最好在解析期間丟棄這些指令和令牌。在本文的稍后部分中，我們將展示如何使用 NeMo Curator 將指令提示或特殊令牌添加到每個條目中DocumentModifier實用程序。

解析和迭代數據集

實現DocumentIterator和DocumentExtractor用于提取電子郵件主題、正文和類別 (類) 標簽的類：

from nemo_curator.download.doc_builder import (
    DocumentExtractor,
    DocumentIterator,
)
 
class EmailsIterator(DocumentIterator):
 
    def __init__(self):
        super().__init__()
        self._counter = -1
        self._extractor = EmailsExtractor()
        # The regular expression pattern to extract each email.
        self._pattern = re.compile(r"\"<s>.*?<s>\"", re.DOTALL)
 
    def iterate(self, file_path):
        self._counter = -1
        file_name = os.path.basename(file_path)
 
        with open(file_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
 
        # Ignore the first line which contains the header.
        file_content = "".join(lines[1:])
        # Find all the emails in the file.
        it = self._pattern.finditer(file_content)
 
        for email in it:
            self._counter += 1
            content = email.group().strip('"').strip()
            meta = {
                "filename": file_name,
                "id": f"email-{self._counter}",
            }
            extracted_content = self._extractor.extract(content)
 
            # Skip if no content extracted
            if not extracted_content:
                continue
 
            record = {**meta, **extracted_content}
            yield record
 
 
class EmailsExtractor(DocumentExtractor):
    def __init__(self):
        super().__init__()
        # The regular expression pattern to extract subject/body/label into groups.
        self._pattern = re.compile(
            r"Subject:: (.*?)\nBody:: (.*?)\n.*\[/INST\] (.*?) <s>", re.DOTALL
        )
 
    def extract(self, content: str) -> Dict[str, str]:
        matches = self._pattern.findall(content)
 
        if not matches:
            return None
 
        matches = matches[0]
 
        return {
            "subject": matches[0].strip(),
            "body": matches[1].strip(),
            "category": matches[2].strip(),
        }

迭代器使用正則表達式，\"<s>.*?<s>\"然后，它將字符串傳遞給提取器，提取器使用正則表達式"Subject:: (.*?)\nBody:: (.*?)\n.*\[/INST\] (.*?) <s>"此表達式使用分組運算符(.*?)提取主題、正文和類別。

這些提取的部分以及有用的元數據（例如每封電子郵件的唯一 ID）存儲在字典中，并返回給調用者。

現在，您可以將此數據集轉換為 JSONL 格式，這是 NeMo Curator 支持的多種格式之一

將數據集寫入 JSONL 格式

數據集以純文本文件的形式下載。DocumentIterator和DocumentExtractor用于迭代記錄的類，將其轉換為 JSONL 格式，并將每條記錄作為一行存儲在文件中。

import json
 
def download_and_convert_to_jsonl() -> str:
    """
    Downloads the emails dataset and converts it to JSONL format.
 
    Returns:
        str: The path to the JSONL file.
    """
 
    # Download the dataset in raw format and convert it to JSONL.
    downloader = EmailsDownloader(DATA_DIR)
    output_path = os.path.join(DATA_DIR, "emails.jsonl")
    raw_fp = downloader.download(DATASET_URL)
 
    iterator = EmailsIterator()
 
    # Parse the raw data and write it to a JSONL file.
    with open(output_path, "w") as f:
        for record in iterator.iterate(raw_fp):
            json_record = json.dumps(record, ensure_ascii=False)
            f.write(json_record + "\n")
 
    return output_path

數據集中每條記錄的信息都寫入多個 JSON 字段：

subject
body
category
Metadata:
- id
- filename

這一點很有必要，因為 NeMo Curator 中的許多數據管護操作必須知道要在每個記錄中操作哪個字段。這一結構允許 NeMo Curator 操作輕松地定位不同的數據集信息。

使用文檔構建器加載數據集

在 NeMo Curator 中，數據集表示為類型對象DocumentDataset.這提供了從磁盤加載各種格式的數據集的輔助工具。使用以下代碼加載數據集并開始使用：

from nemo_curator.datasets import DocumentDataset
# define `filepath` to be the path to the JSONL file created above.
dataset = DocumentDataset.read_json(filepath, add_filename=True)

您現在擁有了定義自定義數據集策管線和準備數據所需的一切。

使用現有工具統一 Unicode 格式

通常最好修復數據集中的所有 Unicode 問題，因為從在線來源抓取的文本可能包含不一致或 Unicode 錯誤。

為了修改文檔，NeMo Curator 提供了一個DocumentModifier界面以及Modify輔助程序，用于定義如何修改每個文檔中的給定文本。有關實現您自己的自定義文檔修改器的更多信息，請參閱文本清理和統一在上一篇文章中看到的部分內容。

在本示例中，應用UnicodeReformatter到數據集。由于每條記錄都有多個字段，因此請對數據集中的每個相關字段應用一次操作。這些操作可以通過Sequential類：

Sequential([
    Modify(UnicodeReformatter(), text_field="subject"),
    Modify(UnicodeReformatter(), text_field="body"),
    Modify(UnicodeReformatter(), text_field="category"),
])

設計自定義數據集過濾器

在許多 PEFT 用例中，優化數據集涉及過濾掉可能無關緊要或質量較低的記錄，或者那些具有特定不合適屬性的記錄。在電子郵件數據集中，有些電子郵件過長或為空。出于演示目的，通過實現自定義，從數據集中刪除所有此類記錄DocumentFilter類：

from nemo_curator.filters import DocumentFilter
 
class FilterEmailsWithLongBody(DocumentFilter):
    """
    If the email is too long, discard.
    """
 
    def __init__(self, max_length: int = 5000):
        super().__init__()
        self.max_length = max_length
 
    def score_document(self, text: str) -> bool:
        return len(text) <= self.max_length
 
    def keep_document(self, score) -> bool:
        return score
 
class FilterEmptyEmails(DocumentFilter):
    """
    Detects empty emails (either empty body, or labeled as empty).
    """
 
    def score_document(self, text: str) -> bool:
        return (
            not isinstance(text, str)  # The text is not a string
            or len(text.strip()) == 0  # The text is empty
            or "Empty message" in text  # The email is labeled as empty
        )
 
    def keep_document(self, score) -> bool:
        return score

我們FilterEmailsWithLongBodyclass 會計算所提供文本中的字符數，并返回True如果長度是可以接受的，或False否則。您必須在body每個記錄的字段。

我們FilterEmptyEmails類檢查給定文本的類型和內容，以確定其是否為空電子郵件，并返回True如果電子郵件被視為空白，或者False否則。您必須在所有相關字段中明確應用此過濾器：subject, body以及category每條記錄的字段。

返回值與類的命名一致，可提高代碼的可讀性。但是，由于目標是丟棄空電子郵件，因此必須反轉此過濾器的結果。換言之，如果過濾器返回，則丟棄記錄True并在過濾器返回時保留記錄False.這可以通過提供相關標志來完成ScoreFilter輔助程序：

Sequential([
    # Apply only to the `body` field.
    ScoreFilter(FilterEmailsWithLongBody(), text_field="body", score_type=bool),
    # Apply to all fields, also invert the action.
    ScoreFilter(FilterEmptyEmails(), text_field="subject", score_type=bool, invert=True),
    ScoreFilter(FilterEmptyEmails(), text_field="body", score_type=bool, invert=True),
    ScoreFilter(FilterEmptyEmails(), text_field="category", score_type=bool, invert=True),
])

指定標志invert=True來指示ScoreFilter丟棄過濾器返回的文檔True.通過指定 score_type=bool為每個過濾器明確指定返回類型，以避免在執行期間進行類型推理。

編輯所有個人識別信息

接下來，定義處理步驟，以編輯每個記錄主題和正文中的所有個人識別信息 (PII)。此數據集包含許多 PII 實例，例如電子郵件、電話或傳真號碼、姓名和地址。

借助 NeMo Curator，您可以輕松指定要檢測的個人身份信息（PII）類型以及對每次檢測采取的操作。使用特殊令牌替換每個檢測：

def redact_pii(dataset: DocumentDataset, text_field) -> DocumentDataset:
    redactor = Modify(
        PiiModifier(
            supported_entities=[
                "ADDRESS",
                "EMAIL_ADDRESS",
                "LOCATION",
                "PERSON",
                "URL",
                "PHONE_NUMBER",
            ],
            anonymize_action="replace",
            device="cpu",
        ),
        text_field=text_field,
    )
    return redactor(dataset)

您可以將這些運算應用到subject和body使用 Pythonfunctools.partial輔助程序：

from functools import partial
 
redact_pii_subject = partial(redact_pii, text_field="subject")
redact_pii_body = partial(redact_pii, text_field="body")
 
Sequential([
    redact_pii_subject,
    redact_pii_body,
    ]
)

添加指令提示

數據管護流程的最后一步是向每條記錄添加指令提示，并確保每個類別的值都以句點終止。通過實現相關的DocumentModifier類：

from nemo_curator.modifiers import DocumentModifier
 
class AddSystemPrompt(DocumentModifier):
    def modify_document(self, text: str) -> str:
        return SYS_PROMPT_TEMPLATE % text
 
 
class AddPeriod(DocumentModifier):
    def modify_document(self, text: str) -> str:
        return text + "."

在代碼示例中，SYS_PROMPT_TEMPLATE變量包含一個格式字符串，可用于在文本周圍添加指令提示。這些修改器可以鏈接在一起：

Sequential([
    Modify(AddSystemPrompt(), text_field="body"),
    Modify(AddPeriod(), text_field="category"),
])

整合管線

在實現管線的每個步驟后，是時候將所有內容放在一起并按順序對數據集應用每個操作了。您可以使用Sequential將類到鏈式管理操作結合在一起：

curation_steps = Sequential(
    [
        #
        # Unify the text encoding to Unicode.
        #
        Modify(UnicodeReformatter(), text_field="subject"),
        Modify(UnicodeReformatter(), text_field="body"),
        Modify(UnicodeReformatter(), text_field="category"),
 
        #
        # Filtering
        #
        ScoreFilter(
            FilterEmptyEmails(), text_field="subject", score_type=bool, invert=True
        ),
        ScoreFilter(
            FilterEmptyEmails(), text_field="body", score_type=bool, invert=True
        ),
        ScoreFilter(
            FilterEmptyEmails(), text_field="category", score_type=bool, invert=True
        ),
        ScoreFilter(FilterEmailsWithLongBody(), text_field="body", score_type=bool),
 
        #
        # Redact personally identifiable information (PII).
        #
 
        redact_pii_subject,
        redact_pii_body,
 
        #
        # Final modifications.
        #
        Modify(AddSystemPrompt(), text_field="body"),
        Modify(AddPeriod(), text_field="category"),
    ]
)
 
dataset = curation_steps(dataset)
dataset = dataset.persist()
dataset.to_json("/output/path", write_to_filename=True)

NeMo Curator 使用 Dask 以分布式方式處理數據集。由于 Dask 操作是延遲評估的，因此您必須調用.persist用于指示 Dask 應用操作的函數。處理完成后，您可以通過調用.to_json并提供輸出路徑。

后續步驟

本教程演示了如何使用 NeMo Curator 創建自定義數據策劃流程，特別關注 SFT 和 PEFT 用例。

為了便于訪問，我們將教程上傳到了NVIDIA NeMo-Curator GitHub 資源庫。為資源庫添加星號，以便及時了解最新開發成果，并接收有關新功能、bug 修復和更新的通知。

現在您已經整理好數據，可以微調 LLM，例如使用 LoRA 進行電子郵件分類的 Llama 2 模型。有關更多信息，請參閱使用 Llama 2 編寫的 NeMo 框架 PEFT 手冊。

您還可以請求訪問 NVIDIA NeMo Curator 微服務，該服務為企業提供了從任何地方開始數據采集的最簡單途徑。如需申請，請參閱 NeMo Curator Microservice Early Access。

利用 NVIDIA NeMo Curator 整理用于 LLM 參數高效微調的自定義數據集

概述

預備知識

定義自定義文檔構建器

下載數據集

解析和迭代數據集

將數據集寫入 JSONL 格式

使用文檔構建器加載數據集

使用現有工具統一 Unicode 格式

設計自定義數據集過濾器

編輯所有個人識別信息

添加指令提示

整合管線

后續步驟

相關資源

標簽

關于作者

利用 NVIDIA NeMo Curator 整理用于 LLM 參數高效微調的自定義數據集

概述

預備知識

定義自定義文檔構建器

下載數據集

解析和迭代數據集

將數據集寫入 JSONL 格式

使用文檔構建器加載數據集

使用現有工具統一 Unicode 格式

設計自定義數據集過濾器

編輯所有個人識別信息

添加指令提示

整合管線

后續步驟

相關資源

標簽

關于作者

相關文章

借助 NVIDIA NeMo Curator 擴展和整理用于 LLM 訓練的高質量數據集

相關文章

如何使用 NVIDIA NeMo Agent 工具套件開源庫構建自定義 AI 智能體

適用于有效 FP8 訓練的按張量和按塊擴展策略

出色的多模態 RAG：Llama 3.2 NeMo 檢索器嵌入模型如何提高工作流準確性

在 NVIDIA Jetson 和 RTX 上運行 Google DeepMind 的 Gemma 3n

提高嵌入模型準確性，實現定制化信息檢索