微調小型語言模型以提高代碼審查準確性

生成式 AI 通過推動眾多應用的創新和提高效率，正在改變企業。然而，采用大型基礎模型會帶來一些挑戰，包括高成本、慢性能、以及數據隱私問題。許多企業不愿與外部 LLM 提供商共享敏感代碼或數據。此外，雖然基礎 LLM 擅長處理一般任務，但它們通常需要大量的提示工程，才能在以企業為中心的特定用例中實現高準確性。

微調小語言模型 (SLMs) 通常利用知識蒸餾等技術，為應對這些挑戰提供了極具吸引力的解決方案。這些較小的 LLM 可提供接近更大模型的性能，并且速度更快、成本效益更高。此外，SLMs 可以部署在本地或虛擬私有云 (VPCs) 中，使企業能夠確保敏感數據的安全。然而，微調較小的模型需要高質量的標記數據，而創建這些數據既耗時又昂貴。

本文介紹了一種自動微調方法，該方法通過使用數據飛輪策略來應對這些挑戰。數據飛輪策略是一種反饋驅動機制，可迭代地提高模型性能。該方法融合了課程學習，這是一種受人類學習啟發的技術，根據復雜性逐步引入訓練數據。通過使用大型“教師”模型生成和結構化合成訓練數據，此方法可優化微調過程，使較小的模型能夠更有效地處理復雜任務，同時盡可能減少人工干預。

我們將介紹以下主題：

自動微調方法 概述：創建高效訓練工作流程的師生范式 。
實施步驟：關鍵階段，如考試生成、評估和微調。
代碼 審查自動化中的應用 ：嚴重程度評級和解釋生成等真實示例，其中自動微調 SLM (Llama 3 8B Instruct+ 低等級適應 (LoRA)，或 Llama3-8b+LoRA) 將準確性提高了 18%，性能優于更大的模型，并提供了與專家匹配的解釋，同時降低成本和延遲。
經驗教訓： 可擴展且經濟高效的 AI 解決方案的最佳實踐。

在本文結束時，您將了解微調 SLM 如何幫助企業在解決成本、延遲和可擴展性相關挑戰的同時，實現具有競爭力的準確性。雖然這里的重點是代碼輔助，但該方法適用于各種企業用例。

自動微調方法概述?

開發的自動微調方法從教師如何調整課程以解決學生的特定改進領域中汲取靈感。它采用師生范式，融合了知識蒸餾的原則。

A process flow depicting the developed automated fine-tuning process. The teacher model generates an exam, the student model takes the exam, and the teacher evaluates the results. Based on the evaluation, the teacher generates a new curriculum for fine-tuning. The loop continues until the desired performance is achieved. — *圖 1. 已開發的自動微調架構的高級概述*

在此過程中，大型 LLM (教師) 使用五個迭代步驟為小型 LLM 或 SLM (學生) 組織和準備訓練數據 (課程)：

1、考試生成：教師 LLM 根據先前的成績、用戶反饋（數據飛輪）和先前的考試結果，為學生 SLM 創建測試。

2、參加測試 ：學生參加由教師生成的測試。

3、評估：教師評估學生的表現，突出學生的優勢和改進領域。

4、課程生成：教師根據評估結果定制培訓，并根據具體不足之處進行調整。

5、微調：學生使用 LoRA 等技術對更新的數據集進行微調。 與調整所有模型參數并需要大量計算資源的傳統微調不同，LoRA 優化了一組較小的特定于任務的參數，從而提高了流程的成本和內存效率。

重復此過程，直到學生的表現穩定下來或達到計算預算，從而確保經濟高效的訓練。

實施微調?

以下各節將深入探討如圖 1 所示的已開發自動微調工作流程中的五個迭代步驟。

1、生成考試?

教師 LLM 使用下方的 EXAM_PROMPT 生成考試。提示輸入包括：

任務數據 ：與任務相關的特定數據，包括用戶反饋或合成示例。
任務提示 ：將任務數據轉換為學生 LLM 的單個訓練示例的提示。
當前的 LLM 熟練程度 ：學生在之前評估中的 LLM 熟練程度。
反饋：由教師生成的見解，凸顯了學生的不足之處。

為生成試題，教師 LLM 會對 DATA_SOURCE 中的每個條目應用 TASK_PROMPT，根據學生和需要改進的領域定制問題。

EXAM_PROMPT 的示例如下所示。

EXAM_PROMPT = """
[TASK]
%s
?
[DATA SOURCE]
%s
?
[PREVIOUS_EXAM_RESULTS]
Proficiency: %s
Feedback: %s
?
Create an exam of %s questions based on the task after the [TASK] tag.
You can use the data after [DATA_SOURCE] for creating the dataset.
Modify the data in the data source appropriately to create questions for the exam (for example to balance the exam).? 
If there is none, then create your own.
Results for the expected proficiency and feedback from the previous exam (if any)
are indicated after [PREVIOUS_EXAM_RESULTS]. You can use that information to create
a better exam (in terms of difficulty, questions etc).
The complete exam must appear as json after any thoughts you have
and there must be no other text after it. Think about your answer carefully.
?
Your output format must strictly be in the following format as a json.
"exam": A list of json objects in the format below
"question": A json with the information in the [INPUT_JSON] of the [TASK]
"answer": A json with the response based on the [OUTPUT_JSON] of the [TASK]
"""

TASK_PROMPT 會生成任務特定的試題。以下是在代碼審查期間預測代碼更改嚴重程度的示例。

TASK_PROMPT = """Assign an issue type to the code below.
?
[ISSUE_TYPES]
critical: Security vulnerabilities, bugs that will cause a crash or code that can abruptly exit the execution.
major: Severe bugs that will cause a system to produce incorrect results.
minor: Results in some unexpected or undesired behavior, but not enough to disrupt system function.
trivial: Issue won't result in any noticeable breakdown of the system. (e.g. docstring changes, comments etc)
?
The code and review are formatted as json below.
[INPUT_JSON]
%s
?
Your output must be in json with no other text. The format is below.
{"issue_type": A value in [critical, major, minor, trivial]}
"""
?
INPUT_JSON = """
"code": The code snippet under review
"review": A review of the code.
"""

生成考試步驟會生成 JSON 格式的問答對列表。以下是用于代碼審查嚴重程度預測任務的問答對示例。問題包括代碼更改和審核反饋，而答案則指定預期的嚴重程度。

{
??????"question": {
????????"code": "<code snippet>",
????????"review": "<code review>"
??????},
??????"answer": {
????????"issue_type": "major"
??????}
????}

2、參加測試?

在此步驟中，我們將使用第 1 步中生成的問題來評估學生的 LLM。每個考試問題都與 TASK_PROMPT 結合，以創建提示列表。學生 LLM 會處理這些提示，并根據其理解生成答案。

例如，在代碼審查嚴重程度預測中，學生 LLM 分析代碼片段和審查反饋，將嚴重程度分類為 critical、major、minor 或 trivial。

3、評估?

學生 LLM 參加考試后，教師 LLM 使用 EXAM_EVALUATION_PROMPT 評估其性能，其中包括：

任務提示 ：用于生成試題。
考試結果 ：第 2 步中學生的答案 (格式為問答對列表)。
數據源 ：教師用于生成其他訓練示例的可選數據源。如果未提供，則教師 LLM 會生成自己的示例。
訓練樣本數 ：指定要創建的新訓練樣本數。這些示例會添加到現有訓練數據中，以解決學生的不足之處。

教師 LLM 會分配能力分數 (1-10)、提供反饋并生成定制的訓練數據集，以改善學生的不足之處。

EXAM_EVALUATION 示例如下所示。

EXAM_EVALUATION_PROMPT = """For the task specified by [TASK], an exam was
administered for evaluating the model's current capabilities. The results
appear after [EXAM_RESULTS] and is a json list where each element is a json
consisting of fields:
"question": The data for the task
"answer": The correct answer
"model_answer": The model's answer. If there is no answer here, then assume that 
the model could not answer the question correctly.
?
[TASK]
%s
?
[EXAM_RESULTS]
%s
?
Your task is to evaluate these exam results to determine the models capabilities
from a scale of 1-10 with 10 being proficient. Next, based on the models results
on the exam, identify areas for improvement.
Finally, please develop a curriculum for of %s examples for helping to train the model.
The output is VALID JSON formatted as follows:
"feedback": The feedback as a string
"proficiency": The proficiency score of the model as an integer.
"dataset": A list of json objects for training the model.
???"question": A question for training the model in the exact same format as the [TASK] asks
???"answer": The answer to the question in the same format as the [TASK] asks
?
You must use the data after [DATA_SOURCE] for creating the dataset. You can modify
the data as you see fit. If there is none, then create your own.
?
[DATA_SOURCE]
%s
"""

4、課程生成?

將在第 3 步中生成的新訓練示例與現有數據集相結合，以創建更新的課程。這種量身定制的課程專門針對學生的不足之處，從而實現更有效的微調。

5、微調?

學生 LLM 使用更新的課程進行微調。微調利用 NVIDIA NeMo 框架，并利用 NeMo 框架 Docker 容器中提供的 megatron_gpt_finetuning.py 腳本進行高效的微調。

為簡化此過程，微調工作流封裝在 PEFTFineTuning 類中，可在 Python 中實現無縫集成和執行。以下是如何啟動微調的示例。

import subprocess
import pathlib
import os
import shutil
?
?
def initialize_directory(directory, clean=True):
???if os.path.exists(directory) and clean:
???????shutil.rmtree(directory)
???os.makedirs(directory, exist_ok=True)
?
class PEFTFineTuning:
?
???MEGATRON_GPT_FINETUNING_SCRIPT = \
???????"/opt/Nemo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py"
?
???def __init__(self, scheme, dataset,
???????model,
???????adapter_name=None,
???????output_dir=None,
???????torchrun_nproc_per_node=1,
???????devices=1, num_nodes=1,
???????megatron_amp_O2=True, mcore_gpt=True,
???????tensor_size=1,
???????pipeline_size=1,
???????micro_batch_size=1,
???????global_batch_size=16,
???????ds_num_workers=0,
???????train_sampling_probs=[1.0],
???????adapter_restore_path=None,
???????lr=1e-4,
???????adapter_dim=32):
?
???????self.nproc_per_node = torchrun_nproc_per_node
?
???????self.megatron_gpt_params = {
???????????"trainer.devices": devices,
???????????"trainer.num_nodes": num_nodes,
???????????"model.megatron_amp_O2": megatron_amp_O2,
???????????"++model.mcore_gpt": mcore_gpt,
???????????"model.tensor_model_parallel_size": tensor_size,
???????????"model.pipeline_model_parallel_size": pipeline_size,
???????????"model.micro_batch_size": micro_batch_size,
???????????"model.global_batch_size": global_batch_size,
???????????"model.data.train_ds.num_workers": ds_num_workers,
???????????"model.data.train_ds.concat_sampling_probabilities": train_sampling_probs,
???????????"model.data.validation_ds.num_workers": ds_num_workers,
???????????"model.peft.peft_scheme": scheme,
???????????"model.optim.lr": lr,
???????????"model.peft.lora_tuning.adapter_dim": adapter_dim
???????}
?
???????if adapter_restore_path is not None:
???????????self.megatron_gpt_params["model.peft.restore_from_path"] = \
???????????????adapter_restore_path
?
?
???????self.model = model
???????self.dataset = dataset
???????self._adapter_name = adapter_name
???????if self._adapter_name is None:
???????????self._adapter_name = "%s_%s" % (scheme, dataset.name)
?
???????self.output_dir = output_dir
???????if self.output_dir is None:
?
???????????self.output_dir = "%s/%s" % (self.model.model_dir,
????????????????????????????????????????self._adapter_name)
???
???@property
???def adapter_name(self):
?
???????return self._adapter_name
?
???def _get_peft_cmd(self):
?
???????cmd = ["torchrun"]
???????cmd.append("--nproc_per_node=%s" % (self.nproc_per_node))
???????cmd.append(PEFTFineTuning.MEGATRON_GPT_FINETUNING_SCRIPT)
???????
???????for key, value in self.megatron_gpt_params.items():
???????????cmd.append("%s=%s" % (key, value))
?
???????return cmd
?
???def finetune(self, clean=True,
????????????????val_check_interval=20, max_steps=8000):
???????
???????initialize_directory(self.output_dir, clean)?? 
?
???????cmd = self._get_peft_cmd()
?
???????cmd += [
???????????"exp_manager.exp_dir=%s" % (self.output_dir),
???????????"exp_manager.explicit_log_dir=%s" % (self.output_dir),
???????????"trainer.precision=%s" % (self.model.precision),
???????????"trainer.val_check_interval=%s" % (val_check_interval),
???????????"trainer.max_steps=%s" % (max_steps),
???????????"model.restore_from_path=%s" % (self.model.model_path),
???????????"model.data.train_ds.file_names=%s" % (self.dataset.train_ds),
???????????"model.data.validation_ds.file_names=%s" % (self.dataset.val_ds),
???????]
?
???????subprocess.call(cmd)
?
???def get_nim_adapter_path(self, base_dir=ncodepro.NIM_STORE):
?
???????nim_store_dir = "%s/%s" % (base_dir, self._adapter_name)
???????nemo_model_path = "%s/%s.nemo" % (nim_store_dir, self._adapter_name)
???????return nemo_model_path
???
???def save(self, base_dir=ncodepro.NIM_STORE, clean=True):
?
???????nim_store_dir = "%s/%s" % (base_dir, self._adapter_name)
???????nemo_model_path = "%s/%s.nemo" % (nim_store_dir, self._adapter_name)
???????
???????file.initialize_directory(nim_store_dir, clean)
?
???????peft_checkpoint = "%s/checkpoints/" \
???????????"megatron_gpt_peft_lora_tuning.nemo" % (self.output_dir)
???????
???????shutil.copyfile(peft_checkpoint, nemo_model_path)

代碼審查自動化中的實際應用?

代碼審查對于確保軟件質量和性能至關重要，傳統上由人工審查人員執行。典型的代碼審查流程包括以下內容：

作者提交包含實現某項功能或問題修復的代碼的合并請求 (Merge Request，MR)。
人工審閱者會評估 MR，提出更改建議或批準代碼。
如果請求更改，作者會修改并重新提交 MR，重復此過程，直到代碼被接受。

Overview of the code review process. The author submits an initial MR, which is reviewed by others. Feedback leads to updates, and the cycle repeats until the code is accepted. — *圖 2、代碼審查過程*

生成式 AI 的最新進展實現了代碼審查流程的自動化，如圖 3 所示。經過微調的 LLM 會評估 MR 以識別錯誤或問題，為每個 LLM 分配嚴重程度，并提供評級說明。該流程會過濾掉低于用戶定義值的低嚴重程度問題，使開發者能夠專注于安全漏洞等關鍵問題。

經過微調的 SLMs 增強了 NVIDIA 自動代碼審查的以下兩個關鍵領域：

嚴重程度評級?：提高 LLM 在分配嚴重程度時的準確性。
解釋生成?：提高 LLM 的推理清晰度和質量。?

Diagram showing automated code review process with LLMs. Code is analyzed, issues are identified and rated for severity, and feedback is provided for the authors to address.The process repeats until the MR is accepted. — *圖 3、使用 LLM 自動進行代碼審查*

性能評估：準確性和質量提升

我們使用自動微調技術微調了 Llama 3 8B Instruct 模型，從而生成了 Llama38B+LORA。我們評估了其在以下兩項任務中的性能：

嚴重程度評級預測：測量嚴重程度預測的準確性。
嚴重程度解釋生成：評估嚴重程度評級的解釋質量。

嚴重程度等級預測

利用知識蒸餾，在 GPT-4 指導下進行微調，顯著提高了較小模型的嚴重程度評級預測準確性。如圖 4 所示，與基準模型 (無微調的 Llama 3 8B) 相比，經過微調的 Llama 3 8B+LoRA (以綠色突出顯示) 實現了超過 18% 的改進。

值得注意的是，微調的 Llama 3 8B+LoRA (Llama3-8b+LORA) 也優于大型模型，例如 Llama 3 70B (大 8 倍) 和 Nemotron 4 340B Instruct (大 40 倍)。盡管準確性更高，但該模型保持了更低的延遲和更低的推理成本，這表明開發的微調方法對于優化較小的模型既高效又高效。

Bar chart comparison of severity rating accuracy across models. The fine-tuned Llama 3 8B Instruct (llama3-8b+LORA) with GPT-4 as a teacher outperformed its baseline by 18% and surpassed larger models, demonstrating better performance with reduced latency and resource usage. — 圖 4、使用不同 LLM 實現的嚴重程度額定精度。經過 LoRA 微調的 Llama 3 8B Instruct 模型 (llama3-8b+LORA) 的性能優于基準 18%，并超越了大型模型

嚴重程度說明質量

為評估解釋質量，我們使用教師 LLM（GPT-4）作為評委。這位教師將微調模型的解釋與其他模型生成的解釋進行了比較。圖 5 展示了偏好差異，該差異用于測量 GPT-4 在微調模型的解釋上優于其他模型的頻率。偏好正差表示微調模型的表現優于其他模型，而負值表示恰恰相反。

結果表明，經過 LoRA 微調的 Llama 3 8B 模型 (llama3-8b+LORA) 的性能始終優于 Llama 70B、Nemotron 340B，甚至超越其基準 (Llama 3 8B)。在所有比較中，微調模型要么是首選，要么表現同樣出色，這表明其與專家級解釋質量標準高度一致。

Bar chart comparing the explanation quality of the fine-tuned Llama 3 8B (llama3-8b+LORA) model against other models. Green bars indicate GPT-4 preference for the LoRA fine-tuned (llama3-8b+LORA) model, while gray bars indicate preference for the other model. The fine-tuned model consistently matches or outperforms other models. — *圖 5、嚴重程度解釋質量比較。偏好差異凸顯了 GPT-4 偏好微調 LoRA (llama3-8b+LORA) 模型解釋的頻率*

微調 SLM 的優勢：效率和性能提升

在代碼審查自動化中應用微調 SLMs（Software Language Models）展示了兩個主要優勢：

經濟高效的微調： 在代碼審查任務中使用微調 SLMs 可以降低成本并縮短延遲。對于需要平衡預算限制和性能要求的企業工作流而言，這是一種理想的方法。
提高準確性和一致性： 使用微調的 SLMs 顯著提升特定于任務的性能。通過提高嚴重程度評級并使解釋符合專家級標準，經過微調的 SLMs 可提供可靠的評估，幫助開發團隊專注于關鍵代碼問題。

從使用 SLM 擴展 AI 中汲取的經驗教訓?

通過使用自動化方法開發微調 SLMs，我們獲得了寶貴的見解，有助于創建專為企業應用定制的經濟高效且可擴展的 AI 解決方案。主要課程包括：

從有針對性的微調入手 ：專注于較小的模型，在性能和資源利用率之間實現最佳平衡。這使企業能夠在擴展之前有效評估權衡。
利用 參數高效微調（PEFT） 和 知識蒸餾 ：將 LoRA 等 PEFT 方法與知識蒸餾相結合，可確保在最大限度地減少計算開銷的同時實現高性能，使其成為資源受限環境的理想選擇。

通過使用微調的小型 LLM，企業可以解決大型模型通常帶來的高成本和低性能挑戰。這一策略使企業能夠實現競爭準確性，同時保持 AI 解決方案的成本效益、響應速度和針對特定需求定制。雖然本文重點介紹了代碼輔助領域的應用，但這種方法具有高度通用性，適用于各種企業用例。

開始針對 AI 應用微調模型?

了解 NVIDIA 生成式 AI 技術如何幫助您根據特定需求微調和部署模型。如果您剛開始使用 NVIDIA NIM，請查看 Building Your First LLM Agent Application 和 Build Your First Human-in-the-Loop AI Agent with NVIDIA NIM，獲取有關 NVIDIA 工具和方法的實踐經驗，以開發和部署 NVIDIA NIM LLM 微服務。

致謝

我們衷心感謝 Rushang Karia、Agustin Rivera、Mark Philipp、Abhinav Kumar、Anbang Xu、Rama Akkiraju、Ashwin Poojary、Ahmad Daoud 和 Ashwin Jha 的寶貴貢獻和大力支持。他們的專業知識和獻身精神有助于這項工作取得成果。

微調小型語言模型以提高代碼審查準確性

自動微調方法概述?

實施微調?

1、生成考試?

2、參加測試?

3、評估?

4、課程生成?

5、微調?

代碼審查自動化中的實際應用?

性能評估：準確性和質量提升

嚴重程度等級預測

嚴重程度說明質量

微調 SLM 的優勢：效率和性能提升

從使用 SLM 擴展 AI 中汲取的經驗教訓?

開始針對 AI 應用微調模型?

致謝

相關資源

標簽

關于作者

微調小型語言模型以提高代碼審查準確性

自動微調方法概述?

實施微調?

1、生成考試?

2、參加測試?

3、評估?

4、課程生成?

5、微調?

代碼審查自動化中的實際應用?

性能評估：準確性和質量提升

嚴重程度等級預測

嚴重程度說明質量

微調 SLM 的優勢：效率和性能提升

從使用 SLM 擴展 AI 中汲取的經驗教訓?

開始針對 AI 應用微調模型?

致謝

相關資源

標簽

關于作者

相關文章

利用特定領域的微調和 NVIDIA NIM 提高翻譯質量

Llama 3.2 加速部署從邊緣到云端實現提速

相關文章

在 NVIDIA NeMo 框架的首發日支持下即時運行 Hugging Face 模型

在 Azure AI Foundry 上使用 NVIDIA NIM 加速 AI 推理

應用具有推理能力的專用大語言模型（LLM）加速電池研究

擴展 NVIDIA Agent Intelligence Toolkit 以支持新的代理式框架

借助 3DGUT 在 gsplat 中革新神經重建和渲染