Synthetic data has become a standard part of large language model (LLM) post-training procedures. Using a large number of synthetically generated examples from either a single or cohort of open-source, commercially permissible LLMs, a base LLM is finetuned either with supervised finetuning or RLHF to gain instruction-following and reasoning skills. This process can be seen as a knowledge distillation process from a cohort of LLM teachers to a target LLM student.
NVIDIA recently open-sourced the Llama-Nemotron post-training dataset that contains 30M synthetic training examples that support improvements in math, code, general reasoning, function calling, and instruction following capabilities. As a proof point, NVIDIA trained and released three models using this dataset:?
Each model delivers leading accuracy across reasoning and agentic tasks within their respective weight classes.
This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the complete training set, in addition to the training technique, tools, and final model weights, NVIDIA supports both the re-creation and the improvement of this approach. The datasets are hosted on Hugging Face.
Data blend
The Llama-Nemotron dataset is composed of roughly 30M samples, distributed in the following broad categories:
Category | Number of samples |
Math | 19,840,970 (~1M unique prompts) |
Code | 9,612,677 |
Science | 708,920 |
Instruction following | 56,339 |
Chat | 39,792 |
Safety | 31,426 |
These samples were generated from a collection of open-source, commercially permissible models in the following distribution.
Model | Number of Samples |
Llama-3.3-70B-Instruct | 420,021 |
Llama-3.1-Nemotron-70B-Instruct | 31,218 |
Llama-3.3-Nemotron-70B-Feedback/Edit/Select | 22,644 |
Mixtral-8x22B-Instruct-v0.1 | 31,426 |
DeepSeek-R1 | 1,212,994 |
Qwen-2.5-Math-7B-Instruct | 19,840,970 |
Qwen-2.5-Coder-32B-Instruct | 8,917,167 |
Qwen-2.5-72B-Instruct | 464,658 |
Qwen-2.5-32B-Instruct | 71,748 |
The prompts were sourced from either public and open corpus or synthetically generated. The prompts were extracted and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and prompts with incorrect syntax.
The responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes.
Chat data curation
For chat data, prompts come from both public, real-world user interactions (wildchat) and the synthetic data generation pipeline. Synthetic prompts covered various tasks such as open QA, closed QA, and creative writing.
For each prompt task, we seeded the LLM generation with a diverse set of topics or keywords so that the prompts covered a wide variety of topics. For responses, we prompted LLMs for multiple generations and then did rejection sampling with the Llama-3.1-Nemotron-70B reward model. This ensured that the responses were of high quality.
To create the LLama-Nemotron 30M dataset, we used the Llama-3.3-70B-instruct and DeepSeek R1 models as the response generator (Figure 1).

To reproduce this chat data collection pipeline, follow the /NVIDIA/NeMo-Curator tutorial notebook.
Math data curation
To build the math-focused portion of the dataset, we developed a comprehensive pipeline for collecting and processing math problems from the Art of Problem Solving community forums.
Our approach involves several stages of LLM-based processing, using Qwen2.5-32B-Instruct model unless mentioned otherwise:
- Problem extraction: We prompted an LLM to identify and extract all problems from the initial forum posts. While most posts contained a single problem, some included multiple problems or none at all.
- Problem classification: Each extracted problem was classified into proof or non-proof, as well as multiple-choice or non-multiple-choice categories.
- Question transformation: For proof questions, we converted them into answer-based questions that required similar problem-solving techniques. All multiple-choice questions were converted into direct answer questions by removing choices and reformulating the question when necessary.
- Answer extraction: For non-proof questions, we attempted to extract the final answer from the forum discussions when available.
- Benchmark Decontamination: We remove questions that closely resemble those in popular math benchmarks using an lmsys pipeline to ensure fair evaluation of the models trained on this data.
- Solution Generation: For each question, we generate multiple solutions using a mix of multiple open-weight LLMs (Qwen2.5-7B-Math-Instruct, QwQ-32B, DeepSeek-R1).
- Solution Validation: We select only solutions that either reach the correct final answer (when extracted) or align with majority voting (when the final answer is unknown). We employ LLM-as-a-judge to verify answer correctness.
This pipeline is implemented using /NVIDIA/NeMo-Skills, a collection of pipelines to improve LLM skills. This toolkit initially focused on the ability to solve mathematical problems, but it can now support any LLM-based synthetic data generation task as well.
Coding data curation
To build the supervised fine-tuning dataset for code generation with reasoning, we used publicly available programming questions from the CodeContests dataset. Our approach involved the following main stages:
- Benchmark decontamination: We removed questions that closely resembled those in popular code benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench). We used the lmsys decontaminator to ensure that none of the questions were paraphrased versions of the questions present in those benchmarks.
- Response generation: We prompted DeepSeek-R1 to generate multiple responses for the programming questions. With a maximum output sequence length of 16K tokens, we generated 32–40 responses per question.
- Reasoning traces and solution validation: We retained the responses that were composed of complete reasoning traces, followed by output sequences that contained the solution code. We parsed the solution code to verify their syntactic correctness.
Like the math data, the code data is curated using a pipeline in /NVIDIA/NeMo-Skills.
Train your models now
The open release of the Llama-Nemotron datasets reaffirms NVIDIA’s commitment to open-source AI development. We hope that the open-source community will embrace and improve upon our approach. Download the open Llama Nemotron datasets from Meta’s Wildflower platform or Hugging Face to build or fine-tune your reasoning models.
You can start replicating these pipelines, curate the datasets for your application with NeMo Curator and NeMo-Skills, and then fine-tune the model with the NeMo framework or NeMo Customizer microservice.