使用 NVIDIA NeMo 定制化 NVIDIA NIM 滿足特定領域需求

為特定企業應用程序采用的大型語言模型（LLM）通常受益于模型自定義。企業需要根據其特定需求定制 LLM，并快速部署這些模型以實現低延遲和高吞吐量推理。本文將幫助您開始此過程。

具體來說，我們將展示如何使用 PubMedQA 數據集定制 Llama 3 8B NIM，以回答生物醫學領域的問題。問題回答對于組織來說至關重要，因為它們需要從大量內容中快速提取關鍵信息，并為客戶提供相關信息。

本教程中使用的 NVIDIA 軟件

NVIDIA NIM 是NVIDIA AI Enterprise的一部分，是一套易于使用的推理微服務，旨在加速企業中性能優化的生成式 AI 模型的部署。NIM 推理微服務可以部署在任何地方，從工作站和本地到云，提供企業控制自己的部署選擇并確保數據安全。它還提供行業領先的延遲和吞吐量，實現經濟高效的擴展，并為最終用戶提供無縫體驗。

現在，用戶可以訪問適用于 Llama 3 8B Instruct 和 Llama 3 70B Instruct 模型的 NIM 推理微服務，以便在任何 NVIDIA 加速的基礎設施上進行自托管部署。如果您剛剛開始進行原型設計，請查看 NVIDIA API 目錄中的 Llama 3 API。

NVIDIA NeMo 是一個用于開發自定義生成式 AI 的端到端平臺。NeMo 包含用于訓練、自定義、檢索增強生成（RAG）、guardrails、toolkits、數據 curation 和模型預訓練的工具。NeMo 提供了一種簡單、經濟高效且快速的方式來采用生成式 AI。

使用 NeMo 框架，企業可以構建與品牌聲音保持一致的模型，并理解特定領域的知識。無論是創建客戶服務聊天機器人還是 IT 幫助機器人，NeMo 都可以幫助開發者構建自定義生成式 AI，該 AI 擅長處理其任務，同時融合行業術語、領域知識和技能以及獨特的組織要求。

圖 1 顯示了使用 NeMo 和 LoRA 自定義 LLM NIM 以及使用 NIM 部署它所涉及的一般步驟。首先，將模型轉換為 .nemo 格式。然后，為 NeMo 模型創建 LoRA 適配器，并將這些適配器與 NIM 一起用于自定義模型的推理。NIM 支持動態加載 LoRA 適配器，從而支持針對不同用例訓練多個 LoRA 模型。

Diagram showing the steps for customizing an LLM NIM with LoRA using NeMo framework and deploying it with NIM. The steps include converting models to .nemo format, creating LoRA adapters with NeMo framework, and then using the LoRA adapter with NIM for inference on the customized model. — *圖 1. 使用 NeMo 框架和 LoRA 自定義 LLM NIM 以及使用 NIM 部署它所涉及的各個步驟*

預備知識

開始之前，請確保您已完成以下內容：

訪問 NVIDIA A100、NVIDIA H100 或 NVIDIA L40S GPU。建議至少有一個或多個 GPU，累積顯存達到 80 GB 或更多。
支持 Docker 的環境，并已安裝 NVIDIA Container Runtime，這將使容器 GPU 感知。
NGC CLI API 密鑰是在您使用 NVIDIA NGC 進行身份驗證并下載 NGC CLI 工具時提供的。
NVIDIA AI Enterprise 許可證。要申請 90 天免費試用許可證，請訪問 API 目錄中的 Llama 3 8B Instruct，然后單擊 Run Anywhere with NIM 按鈕。

第 1 步：下載 Llama 3 8B Instruct 模型

您可以使用 CLI 從 NVIDIA NGC 目錄下載 Llama 3 8B Instruct 模型，該模型已轉換為 .nemo 格式，與 NeMo 框架兼容。

ngc registry model download-version "nvidia/nemo/llama-3-8b-instruct-nemo:1.0"

這將創建一個名為 llama-3-8b-instruct-nemo_v1.0 的文件夾，其中包括.nemo 文件。

第 2 步：獲取 NeMo 框架容器

NeMo 框架可作為NGC 目錄中的 Docker 容器使用，該容器中包含用于 LoRA 微調的環境和所有腳本。

以下代碼假設 Llama 3 8B Instruct 模型文件夾是當前工作目錄的一部分，因此它已被掛載 (位于/workspace)，并且微調腳本可以訪問它。

# Run the docker container in interactive mode
docker run \ 
     --gpus all \
     --shm-size=2g \
     --net=host \
     --ulimit memlock=-1 \
     --rm -it \
     -v ${PWD}:/workspace \
     -w /workspace \
     -v ${PWD}/results:/results \ 
     nvcr.io/nvidia/nemo:24.05 bash

進入容器后，您可以在 Jupyter Notebook 環境中執行其他步驟。

第三步：下載并預處理自定義數據集

PubMedQA 是一個用于醫療領域問答的數據集要下載數據集，請克隆 pubmedqa GitHub 存儲庫，其中包含將數據集拆分為 train/val/test 集的步驟。

下面提供了一個原始示例：

"18251357": { 
"QUESTION": "Does histologic chorioamnionitis correspond to clinical chorioamnionitis?", 
"CONTEXTS": [ "To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother.", "A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection.", "Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)." ], 
"reasoning_required_pred": "yes", 
"reasoning_free_pred": "yes", 
"final_decision": "yes", 
"LONG_ANSWER": "Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent." },

鑒于問題和上下文，本教程的目標是對 Llama 3 8B 進行微調，以給出“是”或“否”的回答。

如需微調，請將數據轉換為.jsonl 格式，其中每行都是以 JSON dict 形式呈現的單獨示例，其中包含用于監督式學習的input:和output:鍵。預處理后，示例如下所示：

{
"input": "OBJECTIVE: To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother ... \nQUESTION: Does histologic chorioamnionitis correspond to clinical chorioamnionitis?\n ### ANSWER (yes|no|maybe): ", 
"output": "<<< yes >>>"}

請注意，輸入內容包括上下文跟問題。

在輸出中，添加“<<<”和“>>>”標記可以驗證 LoRA-tuned 模型，因為基礎模型也可以基于零射模板生成“Yes”/“No”響應。

有關預處理的端到端說明，請參閱Jupyter Notebook 教程。

第 4 步：使用 NeMo 框架微調模型

NeMo 框架包含一個高級 Python 腳本 megatron_gpt_finetuning.py，用于微調，該腳本可以抽象化一些低級 API 調用。

MODEL="/workspace/llama-3-8b-instruct-nemo_v1.0/8b_instruct_nemo_bf16.nemoo"
TRAIN_DS="[./pubmedqa/data/pubmedqa_train.jsonl]"
VALID_DS="[./pubmedqa/data/pubmedqa_val.jsonl]"
TEST_DS="[./pubmedqa/data/pubmedqa_test.jsonl]"
TEST_NAMES="[pubmedqa]"
 
# Tensor and Pipeline model parallelism
TP_SIZE=1
PP_SIZE=1
 
# Save results and checkpoints in this directory
OUTPUT_DIR="./results/Meta-Llama-3-8B-Instruct"
 
torchrun --nproc_per_node=1 \ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
exp_manager.exp_dir=${OUTPUT_DIR} \
exp_manager.explicit_log_dir=${OUTPUT_DIR} \ 
trainer.devices=1 \
trainer.num_nodes=1 \ 
trainer.precision=bf16-mixed \ 
trainer.val_check_interval=20 \ 
trainer.max_epochs=10 \ 
model.megatron_amp_O2=False \ 
++model.mcore_gpt=True \ 
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.micro_batch_size=1 \ 
model.global_batch_size=8 \ 
model.restore_from_path=${MODEL} \ 
model.data.train_ds.num_workers=0 \ 
model.data.validation_ds.num_workers=0 \ 
model.data.train_ds.file_names=${TRAIN_DS} \ 
model.data.train_ds.concat_sampling_probabilities=[1.0] \ 
model.data.validation_ds.file_names=${VALID_DS} \ 
model.peft.peft_scheme="lora"

這將在 $OUTPUT_DIR/checkpoints 中創建 LoRA 適配器，以 .nemo 格式。

The model.peft.peft_scheme 參數決定了所使用的技術。本教程使用 LoRA，但 NeMo 框架也支持其他技術，例如 p-tuning、adapters 和 IA3。

訓練 Llama 3 70B 模型涉及相同的過程，唯一的區別是更多的內存和計算需求，以及要在多個 GPU 上進行分片的模型。推薦配置為八個 NVIDIA A100 或 NVIDIA H100 80 GB GPUs，以及八路 Tensor Parallellism (TP=8，PP=1)。

您可以在運行腳本時覆蓋許多此類配置。有關全套可能的配置，請參閱config yaml。

第 5 步：準備 LoRA 模型庫

現在您已擁有 .nemo LoRA 模型，是時候部署它了。NIM 可以在同一基礎模型上部署多個 LoRA 適配器，它需要一個特定的目錄結構以便理解。

以下示例展示了如何準備此“模型存儲”。每個 LoRA 適配器應被放在一個文件夾中，該文件夾的名稱將用作在推理時向其發送請求的參考。

</path/to/LoRA-model-store>
├── llama3-8b-pubmed-qa
│   └── megatron_gpt_peft_lora_tuning.nemo
├── llama3-8b-lora_model_2_nemo
│   └── llama3-8b-instruct-lora_model_2.nemo
└── llama3-8b-lora_model_3_hf
    ├── adapter_config.json
    └── adapter_model.safetensors

在本教程中，一個 LoRA 適配器在 PubMedQA 上進行了訓練，因此請繼續將其放在模型存儲文件夾中的自己的目錄中。如果您有其他適配器，您可以將此過程與那些用于多 LoRA 部署的適配器一起復制。請注意，NVIDIA NIM 支持使用 NeMo 框架訓練的適配器以及 Hugging Face PEFT。

第 6 步：使用 NIM 部署

模型存儲整理好后，部署只需一個 Docker 命令。

export NGC_API_KEY=<YOUR_NGC_API_KEY>
export LOCAL_PEFT_DIRECTORY=</path/to/LoRA-model-store>
chmod -R 777 $LOCAL_PEFT_DIRECTORY
 
export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
export NIM_PEFT_REFRESH_INTERVAL=3600  # (in seconds) will check NIM_PEFT_SOURCE for newly added models every hour in this interval
export CONTAINER_NAME=meta-llama3-8b-instruct
 
export NIM_CACHE_PATH=</path/to/NIM-model-store-cache>  # Model artifacts (in container) are cached in this directory
 
mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH
 
 
docker run -it --rm --name=$CONTAINER_NAME \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_PEFT_SOURCE \
    -e NIM_PEFT_REFRESH_INTERVAL \
    -v $NIM_CACHE_PATH:/opt/nim/.cache \
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

首次運行該命令時，系統會下載 NVIDIA TensorRT-LLM 優化的 Llama 3 引擎，并將其緩存在 $NIM_CACHE_PATH 中。這將加快后續部署的速度。還有其他幾個選項可用于進一步配置 NIM，您可以在 NIM 配置文檔中找到完整列表。

運行此命令應在端口 8000 上啟動服務器，現在您已準備好開始發送推理請求。

第 7 步：發送推理請求

要創建完成，您可以向/completions 端點發送 POST 請求。要繼續操作，請在單獨的終端中創建 Python 腳本或啟動 Jupyter Notebook。以下命令使用 Python requests 庫。

import requests
import json
 
url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
 
# Example from the PubMedQA test set
prompt="BACKGROUND: Sublingual varices have earlier been related to ageing, smoking and cardiovascular disease. The aim of this study was to investigate whether sublingual varices are related to presence of hypertension.\nMETHODS: In an observational clinical study among 431 dental patients tongue status and blood pressure were documented. Digital photographs of the lateral borders of the tongue for grading of sublingual varices were taken, and blood pressure was measured. Those patients without previous diagnosis of hypertension and with a noted blood pressure \u2265 140 mmHg and/or \u2265 90 mmHg at the dental clinic performed complementary home blood pressure during one week. Those with an average home blood pressure \u2265 135 mmHg and/or \u2265 85 mmHg were referred to the primary health care centre, where three office blood pressure measurements were taken with one week intervals. Two independent blinded observers studied the photographs of the tongues. Each photograph was graded as none/few (grade 0) or medium/severe (grade 1) presence of sublingual varices. Pearson's Chi-square test, Student's t-test, and multiple regression analysis were applied. Power calculation stipulated a study population of 323 patients.\nRESULTS: An association between sublingual varices and hypertension was found (OR = 2.25, p<0.002). Mean systolic blood pressure was 123 and 132 mmHg in patients with grade 0 and grade 1 sublingual varices, respectively (p<0.0001, CI 95 %). Mean diastolic blood pressure was 80 and 83 mmHg in patients with grade 0 and grade 1 sublingual varices, respectively (p<0.005, CI 95 %). Sublingual varices indicate hypertension with a positive predictive value of 0.5 and a negative predictive value of 0.80.\nQUESTION: Is there a connection between sublingual varices and hypertension?\n ### ANSWER (yes|no|maybe): "
 
data = {
    "model": "llama3-8b-pubmed-qa",
    "prompt": prompt,
    "max_tokens": 128
}
 
response = requests.post(url, headers=headers, json=data)
response_data = response.json()
 
print(json.dumps(response_data, indent=4))

輸出如下所示：

{
    "id": "cmpl-403d22baa7c3470eb468ee8a38033e1f",
    "object": "text_completion",
    "created": 1717493046,
    "model": "llama3-8b-pubmed-qa",
    "choices": [
        {
            "index": 0,
            "text": " <<< yes >>>",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 412,
        "total_tokens": 415,
        "completion_tokens": 3
    }
}

此示例返回文本輸出“<<< yes >>>”以及其他元數據。如果您回憶之前的幾個步驟，這就是它所訓練的格式。對整個 PubMedQA 測試集運行推理并計算準確度可提供以下指標：

Accuracy 0.786000
Macro-F1 0.584112

總結

很贊！您已成功自定義 Llama 3 8B Instruct 模型，并使用 NVIDIA NIM 進行部署。與 PubMedQA 排行榜相比，只需幾個訓練步驟和較短的訓練時間，您就可以獲得相當準確的模型。完整教程還包括有關計算這些指標的說明。可以進一步調整超參數以獲得更高的準確性，并且使用 NeMo 框架進行的先進訓練優化可加快迭代速度。

為了進一步簡化生成式 AI 定制，NeMo 團隊宣布了 NVIDIA NeMo Customizer 微服務的早期訪問計劃。這項高性能、可擴展的服務簡化了針對特定領域用例的 LLM 微調和對齊。利用知名的微服務和 API 架構，它幫助企業將解決方案更快地推向市場。申請早期訪問。

使用 NVIDIA NeMo 定制化 NVIDIA NIM 滿足特定領域需求

本教程中使用的 NVIDIA 軟件

預備知識

第 1 步：下載 Llama 3 8B Instruct 模型

第 2 步：獲取 NeMo 框架容器

第三步：下載并預處理自定義數據集

第 4 步：使用 NeMo 框架微調模型

第 5 步：準備 LoRA 模型庫

第 6 步：使用 NIM 部署

第 7 步：發送推理請求

總結

相關資源

標簽

關于作者

使用 NVIDIA NeMo 定制化 NVIDIA NIM 滿足特定領域需求

本教程中使用的 NVIDIA 軟件

預備知識

第 1 步：下載 Llama 3 8B Instruct 模型

第 2 步：獲取 NeMo 框架容器

第三步：下載并預處理自定義數據集

第 4 步：使用 NeMo 框架微調模型

第 5 步：準備 LoRA 模型庫

第 6 步：使用 NIM 部署

第 7 步：發送推理請求

總結

相關資源

標簽

關于作者

相關文章

Llama 3.2 加速部署從邊緣到云端實現提速

在 GPU 加速的 Google Cloud 上使用 NVIDIA NeMo 簡化生成式 AI 開發

相關文章

在 NVIDIA NeMo 框架的首發日支持下即時運行 Hugging Face 模型

在 Azure AI Foundry 上使用 NVIDIA NIM 加速 AI 推理

應用具有推理能力的專用大語言模型（LLM）加速電池研究

擴展 NVIDIA Agent Intelligence Toolkit 以支持新的代理式框架

借助 3DGUT 在 gsplat 中革新神經重建和渲染