Build Custom Reasoning Models with Advanced, Open Post-Training Datasets

Synthetic data has become a standard part of large language model (LLM) post-training procedures. Using a large number of synthetically generated examples from either a single or cohort of open-source, commercially permissible LLMs, a base LLM is finetuned either with supervised finetuning or RLHF to gain instruction-following and reasoning skills. This process can be seen as a knowledge distillation process from a cohort of LLM teachers to a target LLM student.

NVIDIA recently open-sourced the Llama-Nemotron post-training dataset that contains 30M synthetic training examples that support improvements in math, code, general reasoning, function calling, and instruction following capabilities. As a proof point, NVIDIA trained and released three models using this dataset:?

Llama Nemotron Ultra 253B
Llama Nemotron Super 49B
Llama Nemotron Nano 8B.

Each model delivers leading accuracy across reasoning and agentic tasks within their respective weight classes.

This dataset release represents a significant move forward in openness and transparency in model development and improvement. By releasing the complete training set, in addition to the training technique, tools, and final model weights, NVIDIA supports both the re-creation and the improvement of this approach. The datasets are hosted on Hugging Face.

Data blend

The Llama-Nemotron dataset is composed of roughly 30M samples, distributed in the following broad categories:

Category	Number of samples
Math	19,840,970 (~1M unique prompts)
Code	9,612,677
Science	708,920
Instruction following	56,339
Chat	39,792
Safety	31,426

Table 1. Dataset categories and sample sizes

These samples were generated from a collection of open-source, commercially permissible models in the following distribution.

Model	Number of Samples
Llama-3.3-70B-Instruct	420,021
Llama-3.1-Nemotron-70B-Instruct	31,218
Llama-3.3-Nemotron-70B-Feedback/Edit/Select	22,644
Mixtral-8x22B-Instruct-v0.1	31,426
DeepSeek-R1	1,212,994
Qwen-2.5-Math-7B-Instruct	19,840,970
Qwen-2.5-Coder-32B-Instruct	8,917,167
Qwen-2.5-72B-Instruct	464,658
Qwen-2.5-32B-Instruct	71,748

Table 2. Source models and sample sizes

The prompts were sourced from either public and open corpus or synthetically generated. The prompts were extracted and then filtered for quality and complexity, or generated to meet quality and complexity requirements. This included filtration such as removing inconsistent prompts, prompts with answers that are easy to guess, and prompts with incorrect syntax.

The responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes.

Chat data curation

For chat data, prompts come from both public, real-world user interactions (wildchat) and the synthetic data generation pipeline. Synthetic prompts covered various tasks such as open QA, closed QA, and creative writing.

For each prompt task, we seeded the LLM generation with a diverse set of topics or keywords so that the prompts covered a wide variety of topics. For responses, we prompted LLMs for multiple generations and then did rejection sampling with the Llama-3.1-Nemotron-70B reward model. This ensured that the responses were of high quality.

To create the LLama-Nemotron 30M dataset, we used the Llama-3.3-70B-instruct and DeepSeek R1 models as the response generator (Figure 1).

The workflow diagram shows a user with a domain-specific input query that goes through a public LLM as a synthetic response generator and a reward model to score the responses. The end result, after the synthetic response data is filtered, is a synthetic dataset. — *Figure 1. Chat data curation pipeline*

To reproduce this chat data collection pipeline, follow the /NVIDIA/NeMo-Curator tutorial notebook.

Math data curation

To build the math-focused portion of the dataset, we developed a comprehensive pipeline for collecting and processing math problems from the Art of Problem Solving community forums.

Our approach involves several stages of LLM-based processing, using Qwen2.5-32B-Instruct model unless mentioned otherwise:

Problem extraction: We prompted an LLM to identify and extract all problems from the initial forum posts. While most posts contained a single problem, some included multiple problems or none at all.
Problem classification: Each extracted problem was classified into proof or non-proof, as well as multiple-choice or non-multiple-choice categories.
Question transformation: For proof questions, we converted them into answer-based questions that required similar problem-solving techniques. All multiple-choice questions were converted into direct answer questions by removing choices and reformulating the question when necessary.
Answer extraction: For non-proof questions, we attempted to extract the final answer from the forum discussions when available.
Benchmark Decontamination: We remove questions that closely resemble those in popular math benchmarks using an lmsys pipeline to ensure fair evaluation of the models trained on this data.
Solution Generation: For each question, we generate multiple solutions using a mix of multiple open-weight LLMs (Qwen2.5-7B-Math-Instruct, QwQ-32B, DeepSeek-R1).
Solution Validation: We select only solutions that either reach the correct final answer (when extracted) or align with majority voting (when the final answer is unknown). We employ LLM-as-a-judge to verify answer correctness.

This pipeline is implemented using /NVIDIA/NeMo-Skills, a collection of pipelines to improve LLM skills. This toolkit initially focused on the ability to solve mathematical problems, but it can now support any LLM-based synthetic data generation task as well.

Coding data curation

To build the supervised fine-tuning dataset for code generation with reasoning, we used publicly available programming questions from the CodeContests dataset. Our approach involved the following main stages:

Benchmark decontamination: We removed questions that closely resembled those in popular code benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench). We used the lmsys decontaminator to ensure that none of the questions were paraphrased versions of the questions present in those benchmarks.
Response generation: We prompted DeepSeek-R1 to generate multiple responses for the programming questions. With a maximum output sequence length of 16K tokens, we generated 32–40 responses per question.
Reasoning traces and solution validation: We retained the responses that were composed of complete reasoning traces, followed by output sequences that contained the solution code. We parsed the solution code to verify their syntactic correctness.

Like the math data, the code data is curated using a pipeline in /NVIDIA/NeMo-Skills.

Train your models now

The open release of the Llama-Nemotron datasets reaffirms NVIDIA’s commitment to open-source AI development. We hope that the open-source community will embrace and improve upon our approach. Download the open Llama Nemotron datasets from Meta’s Wildflower platform or Hugging Face to build or fine-tune your reasoning models.

You can start replicating these pipelines, curate the datasets for your application with NeMo Curator and NeMo-Skills, and then fine-tune the model with the NeMo framework or NeMo Customizer microservice.