A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or reinforcement learning (RL), and model evaluation. Each stage requires using different libraries, which are often challenging to set up and difficult to use together.
For example, you might use NVIDIA TensorRT-LLM or vLLM for SDG and NVIDIA NeMo or verl for training. In this case, you’d need to invoke many different scripts and containers to convert a Hugging Face checkpoint to TensorRT-LLM, do large-scale SDG, convert the data and model into NeMo format, and run training followed by an evaluation on various benchmarks.
To streamline this complex workflow, NVIDIA developed the NeMo-Skills library. It offers high-level abstractions that seamlessly connect different frameworks, enabling their use in a unified and interchangeable manner. NeMo-Skills also makes it easy to transition from quick local prototyping to orchestrating large-scale jobs on a Slurm cluster.
This post walks you through a simplified version of the pipeline that helped the NVIDIA team win the AIMO2 Kaggle competition. The process starts with a model that has limited mathematical reasoning capabilities. Those skills are then enhanced through a series of NeMo-Skills jobs.
If you’re following along, you’ll need access to either an NVIDIA DGX box with eight NVIDIA A100 (or newer) GPUs or a Slurm cluster with similarly configured nodes. All commands used in the walkthrough were tested with NeMo-Skills.
Set up NeMo-Skills locally or on Slurm
To orchestrate complex jobs, NeMo-Skills uses Docker containers. You’ll need to install NVIDIA Container Toolkit if running locally or use a Slurm cluster that supports NVIDIA/pyxis. In both cases, it’s recommended that you set up NeMo-Skills on a local workstation and configure it to access your Slurm cluster through SSH. It will take care of uploading your code and scheduling jobs.
Run the following commands locally to complete the setup:
pip install git+https: //github .com /NVIDIA/NeMo-Skills .git ns setup |
When prompted to add mounts, define a folder as /workspace. This folder will be used in subsequent commands. For more details, see the NeMo-Skills configs documentation.
In the following sections, we will always use commands with --cluster=local
argument which you’d need to change to --cluster=slurm
(or whatever you named the config during the setup process) if running on Slurm. When using Slurm, all commands will finish immediately and schedule jobs in the cluster queue.
Weights & Biases (W&B) will be used for convenient logging of evaluation results and model outputs. You can disable this by removing all W&B related arguments from the commands.
Establish a baseline
Before working on improving LLM skills, first evaluate the original model to see where it stands. Note that this tutorial works with Qwen2.5 14B Instruct and uses AIME24 and AIME25 to evaluate the model’s mathematical reasoning abilities. vLLM is used as the inference library.
# download the model ns run_cmd --expname=download-14b --log_dir= /workspace/Qwen2 .5-14B-Instruct --cluster= local \ huggingface-cli download Qwen /Qwen2 .5-14B-Instruct -- local - dir /workspace/Qwen2 .5-14B-Instruct # prepare benchmark data ns prepare_data aime24 aime25 # launch evaluation ns eval \ --cluster= local \ --expname=baseline- eval \ --run_after=download-14b \ --model= /workspace/Qwen2 .5-14B-Instruct \ --server_type=vllm \ --server_gpus=8 \ --benchmarks=aime24:8,aime25:8 \ --output_dir= /workspace/evals/baseline # summarize results, after the evaluation job is done ns summarize_results --cluster= local /workspace/evals/baseline --wandb_name=baseline-evals |
The ns eval
command will run eight generations for each sample in aime24/25 benchmarks and summarize_results
will report an average pass@1, pass@8, and majority@8 metrics.
--------------------------------- aime24 -------------------------------- evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer pass@1[8] | 30 | 829 | 11.67% | 0.00% majority@8 | 30 | 829 | 13.33% | 0.00% pass@8 | 30 | 829 | 33.33% | 0.00% --------------------------------- aime25 -------------------------------- evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer pass@1[8] | 30 | 834 | 11.67% | 0.42% majority@8 | 30 | 834 | 20.00% | 0.00% pass@8 | 30 | 834 | 26.67% | 0.00% |
Note that you might not get exactly the same numbers because of the stochastic nature of LLM generations. You can read more about ns eval
pipeline options in the NeMo-Skills evaluation documentation.
Using SDG with TensorRT-LLM or vLLM
To improve on the established baseline, you can generate some synthetic mathematical data. Following the OpenMathReasoning recipe, use a small set of AoPS forum discussions and extract problems from them using Qwen2.5 14B Instruct. New “long reasoning” solutions will then be generated using QwQ 32B. These problem-solution pairs will be used for training.
This simplified pipeline is very basic and misses multiple important steps (extracting ground-truth answers and filtering for correctness, for example). However, it should be enough to teach the 14B model how to use long reasoning and significantly improve the baseline results.
Start by downloading the data, as well as the problem extraction prompt and postprocessing script:
ns run_cmd --expname=prepare-data --log_dir= /workspace/prepare-data --cluster= local \ ' cd /workspace && \ export DOWNLOAD_PREFIX=https: //raw .githubusercontent.com /NVIDIA/NeMo-Skills/refs/heads/main/recipes/openmathreasoning && \ wget $DOWNLOAD_PREFIX /scripts/prepare_raw_data .py && \ wget $DOWNLOAD_PREFIX /prompts/extract-problems .yaml && \ wget $DOWNLOAD_PREFIX /scripts/postprocess_problem_extraction .py && \ python prepare_raw_data.py && \ head -n 1000 raw_aops_data.jsonl > data.jsonl' |
The fields from the data.jsonl will be used to fill the prompt in extract-problems.yaml, and this final prompt will be passed to an LLM. To learn more, you can inspect the data file and prompt script. For more details about prompt format, see the NeMo-Skills prompts documentation.
Next, run the generation pipeline using the NeMo-Skills Python API:
# run_sdg.py from nemo_skills.pipeline.cli import generate, wrap_arguments cluster = "local" num_gpus = 8 postprocess_cmd = ( f "python /workspace/postprocess_problem_extraction.py " f " /workspace/sdg/problems/output.jsonl " f " /workspace/sdg/extracted-problems.jsonl " ) generate( ctx = wrap_arguments( f "++prompt_config=/workspace/extract-problems.yaml " f "++prompt_template=qwen-instruct " ), cluster = cluster, input_file = "/workspace/data.jsonl" , output_dir = "/workspace/sdg/problems" , postprocess_cmd = postprocess_cmd, expname = "problem-extraction" , run_after = [ "prepare-data" , "download-14b" ], model = "/workspace/Qwen2.5-14B-Instruct" , server_type = "vllm" , server_gpus = num_gpus, # remove these parameters to disable wandb logging log_samples = True , wandb_group = "sdg" , ) |
You can inspect sdg/extracted-problems.yaml to see the outputs. There should be a new field containing the extracted problems. Use the QwQ 32B model to generate solutions to these problems. Since this model produces long reasoning solutions that contain many tokens, convert the checkpoint to TensorRT-LLM format for the fastest inference.
# download the model ns run_cmd --expname=download-qwq --log_dir= /workspace/QwQ-32B --cluster= local \ huggingface-cli download Qwen /QwQ-32B -- local - dir /workspace/QwQ-32B # convert to trtllm format ns convert \ --cluster= local \ --expname=convert-qwq-trtllm \ --run_after=download-qwq \ --input_model= /workspace/QwQ-32B \ --output_model= /workspace/qwq32b-trtllm \ --convert_from=hf \ --convert_to=trtllm \ --num_gpus=8 \ --model_type=qwen \ --hf_model_name=Qwen /QwQ-32B \ --max_seq_len 10000 |
The next step is to generate solutions. Add the following code to the end of sdg.py script and rerun it. By default, it will skip the problem extraction step (if it’s complete) because NeMo-Skills can detect if the generation has already finished.
generate( ctx = wrap_arguments( f "++prompt_config=generic/math " f "++inference.temperature=0.6 " f "++inference.tokens_to_generate=8192 " f "++prompt_template=qwen-instruct " ), cluster = cluster, input_file = "/workspace/sdg/extracted-problems.jsonl" , output_dir = "/workspace/sdg/solutions" , expname = "solution-generation" , run_after = [ "problem-extraction" , "convert-qwq-trtllm" ], model = "/workspace/qwq32b-trtllm" , server_type = "trtllm" , server_gpus = num_gpus, # remove these parameters to disable wandb logging log_samples = True , wandb_group = "sdg" , ) |
It might take a few hours for this job to complete. If you’re able to run on multiple nodes in a Slurm cluster, you can parallelize it across N independent jobs by adding num_chunks=N
. You can learn more about this and other parameters in the NeMo-Skills generation documentation.
If you have W&B logging enabled, you can inspect generations there. Figure 1 shows the output of the ns generate pipeline in the W&B dashboard. You can find it by opening one experiment, switching to the Files tab, and clicking on samples.json.

ns generate
pipeline in the W&B dashboardModel training with NeMo
With the synthetic data generated, you can use it to fine-tune the model. The following sections will show how to use either NeMo-Aligner or NeMo-RL to do this.
First, prepare the data in the required format:
ns run_cmd --log_dir= /workspace/prepare-sft-data --expname=prepare-sft-data --run_after=solution-generation --cluster= local \ 'python -m nemo_skills.training.prepare_data \ ++input_files= /workspace/sdg/solutions/output .jsonl \ ++output_path= /workspace/sft-data .jsonl \ ++prompt_config=generic /math \ ++prompt_template=qwen-instruct \ ++filters.remove_contaminated= false \ ++add_unlabeled= true \ ++filters.remove_no_think_tags= true \ ++filters.trim_solutions= false ' |
Next, convert the model to NeMo format. You can skip this step for NeMo-RL training.
ns convert \ --cluster= local \ --expname=convert-14b-nemo \ --run_after=download-14b \ --input_model= /workspace/Qwen2 .5-14B-Instruct \ --output_model= /workspace/qwen2 .5-14b-instruct-nemo \ --convert_from=hf \ --convert_to=nemo \ --num_gpus=8 \ --model_type=qwen \ --hf_model_name=Qwen /Qwen2 .5-14B-Instruct |
For the NeMo-Aligner backend, use the following training command. Add --disable_wandb
to disable W&B logging.
ns train \ --cluster= local \ --expname=training \ --run_after=convert-14b-nemo \ --run_after=prepare-sft-data \ --output_dir= /workspace/training \ --nemo_model= /workspace/qwen2 .5-14b-instruct-nemo \ --num_nodes=1 \ --num_gpus=8 \ --training_data= /workspace/sft-data .jsonl \ ++model.data.train_ds.max_seq_length=8192 \ ++model.data.train_ds.global_batch_size=32 \ ++model.tensor_model_parallel_size=4 \ ++model.context_parallel_size=2 \ ++model.optim.lr=1e-5 \ ++trainer.sft.max_epochs=2 |
For the NeMo-RL backend, use the following training command. Add --disable_wandb
to disable W&B logging. Only run one of the training commands, not both (or change the paths and exp names accordingly).
ns nemo_rl sft \ --cluster= local \ --expname=training \ --run_after=download-14b \ --run_after=prepare-sft-data \ --output_dir= /workspace/training \ --hf_model= /workspace/Qwen2 .5-14B-Instruct \ --num_nodes=1 \ --num_gpus=8 \ --training_data= /workspace/sft-data .jsonl \ --cache_dir= /workspace/nemo-rl-cache \ --final_hf_path= /workspace/training/qwen2 .5-14b-improved-hf \ ++sft.max_num_epochs=4 \ ++policy.dtensor_cfg.tensor_parallel_size=8 \ ++policy.max_total_sequence_length=8192 \ ++policy.train_global_batch_size=32 \ ++policy.optimizer.kwargs.lr=1e-5 \ ++policy.dtensor_cfg.sequence_parallel= true \ ++policy.dtensor_cfg.activation_checkpointing= true |
To learn more about SFT configuration, see the NeMo-Skills training documentation. If you have W&B logging enabled, you can inspect the training metrics there.

Final evaluation
To check model improvement, run another evaluation. Convert the checkpoint back into Hugging Face format for faster evaluation. You can skip the conversion step if you’re using the NeMo-RL backend for training.
# converting back to HF format ns convert \ --cluster= local \ --expname=convert-14b-hf \ --run_after=training \ --input_model= /workspace/training/model-averaged-nemo \ --output_model= /workspace/training/qwen2 .5-14b-improved-hf \ --convert_from=nemo \ --convert_to=hf \ --num_gpus=8 \ --model_type=qwen \ --hf_model_name=Qwen /Qwen2 .5-14B-Instruct # launching evaluation ns eval \ --cluster= local \ --expname=final- eval \ --run_after=convert-14b-hf \ --run_after=training \ --model= /workspace/training/qwen2 .5-14b-improved-hf \ --server_type=vllm \ --server_gpus=8 \ --benchmarks=aime24:8,aime25:8 \ --output_dir= /workspace/evals/after-training \ ++inference.tokens_to_generate=16384 # summarize results, after the evaluation job is done ns summarize_results --cluster= local /workspace/evals/after-training --wandb_name=after-training-evals |
This evaluation should show good improvements for both benchmarks. Figure 3 shows evaluation results in the W&B dashboard. Switch to the Runs panel and click on Columns to customize the displayed metrics.
--------------------------------- aime24 -------------------------------- evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer pass@1[8] | 30 | 13362 | 27.92% | 55.83% majority@8 | 30 | 13362 | 40.00% | 16.67% pass@8 | 30 | 13362 | 50.00% | 16.67% --------------------------------- aime25 -------------------------------- evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer pass@1[8] | 30 | 13445 | 17.92% | 53.33% majority@8 | 30 | 13445 | 26.67% | 10.00% pass@8 | 30 | 13445 | 36.67% | 10.00% |

Improve any LLM with NeMo-Skills
With NeMo-Skills, you can easily build sophisticated pipelines by connecting the various stages needed to improve LLM abilities. This enables you to seamlessly switch between different training and inference frameworks. All the commands used in this tutorial can be combined into a single script that schedules the entire job. With just one line change, you can transition from quick prototyping on your local workstation to large-scale experiments on a Slurm cluster.
As an exercise, try adding the extra filtering steps mentioned in the NeMo-Skills documentation. You can also try generating multiple solutions per problem and check how this affects final evaluation results. As you will see, having a single script that runs everything—from data generation to model training to evaluation—makes it very easy to iterate on changes to any part of the pipeline.
The NVIDIA team successfully used NeMo-Skills to develop several popular models and datasets. Specifically, it was used to:
- Power the winning NVIDIA submission to AIMO-2 competition and develop the OpenMathReasoning dataset and OpenMath-Nemotron models.
- Create OpenCodeReasoning and OpenCodeInstruct collections.
- Curate a portion of post-training data for the Llama-Nemotron model series.
Get started with NeMo-Skills to leverage it for your own projects.