How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills

A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or reinforcement learning (RL), and model evaluation. Each stage requires using different libraries, which are often challenging to set up and difficult to use together.

For example, you might use NVIDIA TensorRT-LLM or vLLM for SDG and NVIDIA NeMo or verl for training. In this case, you’d need to invoke many different scripts and containers to convert a Hugging Face checkpoint to TensorRT-LLM, do large-scale SDG, convert the data and model into NeMo format, and run training followed by an evaluation on various benchmarks.

To streamline this complex workflow, NVIDIA developed the NeMo-Skills library. It offers high-level abstractions that seamlessly connect different frameworks, enabling their use in a unified and interchangeable manner. NeMo-Skills also makes it easy to transition from quick local prototyping to orchestrating large-scale jobs on a Slurm cluster.

This post walks you through a simplified version of the pipeline that helped the NVIDIA team win the AIMO2 Kaggle competition. The process starts with a model that has limited mathematical reasoning capabilities. Those skills are then enhanced through a series of NeMo-Skills jobs.

If you’re following along, you’ll need access to either an NVIDIA DGX box with eight NVIDIA A100 (or newer) GPUs or a Slurm cluster with similarly configured nodes. All commands used in the walkthrough were tested with NeMo-Skills.

Video 1. Learn how to get started with NeMo-Skills to build powerful training and inference pipelines

Set up NeMo-Skills locally or on Slurm

To orchestrate complex jobs, NeMo-Skills uses Docker containers. You’ll need to install NVIDIA Container Toolkit if running locally or use a Slurm cluster that supports NVIDIA/pyxis. In both cases, it’s recommended that you set up NeMo-Skills on a local workstation and configure it to access your Slurm cluster through SSH. It will take care of uploading your code and scheduling jobs.

Run the following commands locally to complete the setup:

pip install git+https://github.com/NVIDIA/NeMo-Skills.git  
ns setup

When prompted to add mounts, define a folder as /workspace. This folder will be used in subsequent commands. For more details, see the NeMo-Skills configs documentation.

In the following sections, we will always use commands with --cluster=local argument which you’d need to change to --cluster=slurm (or whatever you named the config during the setup process) if running on Slurm. When using Slurm, all commands will finish immediately and schedule jobs in the cluster queue.

Weights & Biases (W&B) will be used for convenient logging of evaluation results and model outputs. You can disable this by removing all W&B related arguments from the commands.

Establish a baseline

Before working on improving LLM skills, first evaluate the original model to see where it stands. Note that this tutorial works with Qwen2.5 14B Instruct and uses AIME24 and AIME25 to evaluate the model’s mathematical reasoning abilities. vLLM is used as the inference library.

# download the model
ns run_cmd --expname=download-14b --log_dir=/workspace/Qwen2.5-14B-Instruct --cluster=local \
    huggingface-cli download Qwen/Qwen2.5-14B-Instruct --local-dir /workspace/Qwen2.5-14B-Instruct
# prepare benchmark data
ns prepare_data aime24 aime25
# launch evaluation
ns eval \
    --cluster=local \
    --expname=baseline-eval \
    --run_after=download-14b \
    --model=/workspace/Qwen2.5-14B-Instruct \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=aime24:8,aime25:8 \
    --output_dir=/workspace/evals/baseline
# summarize results, after the evaluation job is done
ns summarize_results --cluster=local /workspace/evals/baseline --wandb_name=baseline-evals

The ns eval command will run eight generations for each sample in aime24/25 benchmarks and summarize_results will report an average pass@1, pass@8, and majority@8 metrics.

--------------------------------- aime24 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 829        | 11.67%           | 0.00%
majority@8      | 30          | 829        | 13.33%           | 0.00%
pass@8          | 30          | 829        | 33.33%           | 0.00%
 
 
--------------------------------- aime25 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 834        | 11.67%           | 0.42%
majority@8      | 30          | 834        | 20.00%           | 0.00%
pass@8          | 30          | 834        | 26.67%           | 0.00%

Note that you might not get exactly the same numbers because of the stochastic nature of LLM generations. You can read more about ns eval pipeline options in the NeMo-Skills evaluation documentation.

Using SDG with TensorRT-LLM or vLLM

To improve on the established baseline, you can generate some synthetic mathematical data. Following the OpenMathReasoning recipe, use a small set of AoPS forum discussions and extract problems from them using Qwen2.5 14B Instruct. New “long reasoning” solutions will then be generated using QwQ 32B. These problem-solution pairs will be used for training.

This simplified pipeline is very basic and misses multiple important steps (extracting ground-truth answers and filtering for correctness, for example). However, it should be enough to teach the 14B model how to use long reasoning and significantly improve the baseline results.

Start by downloading the data, as well as the problem extraction prompt and postprocessing script:

ns run_cmd --expname=prepare-data --log_dir=/workspace/prepare-data --cluster=local \
    'cd /workspace && \
    export DOWNLOAD_PREFIX=https://raw.githubusercontent.com/NVIDIA/NeMo-Skills/refs/heads/main/recipes/openmathreasoning && \
    wget $DOWNLOAD_PREFIX/scripts/prepare_raw_data.py && \
    wget $DOWNLOAD_PREFIX/prompts/extract-problems.yaml && \
    wget $DOWNLOAD_PREFIX/scripts/postprocess_problem_extraction.py && \
    python prepare_raw_data.py && \
    head -n 1000 raw_aops_data.jsonl > data.jsonl'

The fields from the data.jsonl will be used to fill the prompt in extract-problems.yaml, and this final prompt will be passed to an LLM. To learn more, you can inspect the data file and prompt script. For more details about prompt format, see the NeMo-Skills prompts documentation.

Next, run the generation pipeline using the NeMo-Skills Python API:

# run_sdg.py
from nemo_skills.pipeline.cli import generate, wrap_arguments
 
cluster = "local"
num_gpus = 8
 
postprocess_cmd = (
    f"python /workspace/postprocess_problem_extraction.py "
    f"    /workspace/sdg/problems/output.jsonl "
    f"    /workspace/sdg/extracted-problems.jsonl "
)
 
generate(
    ctx=wrap_arguments(
        f"++prompt_config=/workspace/extract-problems.yaml "
        f"++prompt_template=qwen-instruct "
    ),
    cluster=cluster,
    input_file="/workspace/data.jsonl",
    output_dir="/workspace/sdg/problems",
    postprocess_cmd=postprocess_cmd,
    expname="problem-extraction",
    run_after=["prepare-data", "download-14b"],
    model="/workspace/Qwen2.5-14B-Instruct",
    server_type="vllm",
    server_gpus=num_gpus,
    # remove these parameters to disable wandb logging
    log_samples=True,
    wandb_group="sdg",
)

You can inspect sdg/extracted-problems.yaml to see the outputs. There should be a new field containing the extracted problems. Use the QwQ 32B model to generate solutions to these problems. Since this model produces long reasoning solutions that contain many tokens, convert the checkpoint to TensorRT-LLM format for the fastest inference.

# download the model
ns run_cmd --expname=download-qwq --log_dir=/workspace/QwQ-32B --cluster=local \
    huggingface-cli download Qwen/QwQ-32B --local-dir /workspace/QwQ-32B
# convert to trtllm format
ns convert \
    --cluster=local \
    --expname=convert-qwq-trtllm \
    --run_after=download-qwq \
    --input_model=/workspace/QwQ-32B \
    --output_model=/workspace/qwq32b-trtllm \
    --convert_from=hf \
    --convert_to=trtllm \
    --num_gpus=8 \
    --model_type=qwen \
    --hf_model_name=Qwen/QwQ-32B \
    --max_seq_len 10000

The next step is to generate solutions. Add the following code to the end of sdg.py script and rerun it. By default, it will skip the problem extraction step (if it’s complete) because NeMo-Skills can detect if the generation has already finished.

generate(
    ctx=wrap_arguments(
        f"++prompt_config=generic/math "
        f"++inference.temperature=0.6 "
        f"++inference.tokens_to_generate=8192 "
        f"++prompt_template=qwen-instruct "
    ),
    cluster=cluster,
    input_file="/workspace/sdg/extracted-problems.jsonl",
    output_dir="/workspace/sdg/solutions",
    expname="solution-generation",
    run_after=["problem-extraction", "convert-qwq-trtllm"],
    model="/workspace/qwq32b-trtllm",
    server_type="trtllm",
    server_gpus=num_gpus,
    # remove these parameters to disable wandb logging
    log_samples=True,
    wandb_group="sdg",
)

It might take a few hours for this job to complete. If you’re able to run on multiple nodes in a Slurm cluster, you can parallelize it across N independent jobs by adding num_chunks=N. You can learn more about this and other parameters in the NeMo-Skills generation documentation.

If you have W&B logging enabled, you can inspect generations there. Figure 1 shows the output of the ns generate pipeline in the W&B dashboard. You can find it by opening one experiment, switching to the Files tab, and clicking on samples.json.

Screenshot showing the output of ns generate pipeline logged in the Weights & Biases dashboard. — *Figure 1. The output of the `ns generate` pipeline in the W&B dashboard*

Model training with NeMo

With the synthetic data generated, you can use it to fine-tune the model. The following sections will show how to use either NeMo-Aligner or NeMo-RL to do this.

First, prepare the data in the required format:

ns run_cmd --log_dir=/workspace/prepare-sft-data --expname=prepare-sft-data --run_after=solution-generation --cluster=local \
    'python -m nemo_skills.training.prepare_data \
      ++input_files=/workspace/sdg/solutions/output.jsonl \
      ++output_path=/workspace/sft-data.jsonl \
      ++prompt_config=generic/math \
      ++prompt_template=qwen-instruct \
      ++filters.remove_contaminated=false \
      ++add_unlabeled=true \
      ++filters.remove_no_think_tags=true \
      ++filters.trim_solutions=false'

Next, convert the model to NeMo format. You can skip this step for NeMo-RL training.

ns convert \
    --cluster=local \
    --expname=convert-14b-nemo \
    --run_after=download-14b \
    --input_model=/workspace/Qwen2.5-14B-Instruct \
    --output_model=/workspace/qwen2.5-14b-instruct-nemo \
    --convert_from=hf \
    --convert_to=nemo \
    --num_gpus=8 \
    --model_type=qwen \
    --hf_model_name=Qwen/Qwen2.5-14B-Instruct

For the NeMo-Aligner backend, use the following training command. Add --disable_wandb to disable W&B logging.

ns train \
    --cluster=local \
    --expname=training \
    --run_after=convert-14b-nemo \
    --run_after=prepare-sft-data \
    --output_dir=/workspace/training \
    --nemo_model=/workspace/qwen2.5-14b-instruct-nemo \
    --num_nodes=1 \
    --num_gpus=8 \
    --training_data=/workspace/sft-data.jsonl \
    ++model.data.train_ds.max_seq_length=8192 \
    ++model.data.train_ds.global_batch_size=32 \
    ++model.tensor_model_parallel_size=4 \
    ++model.context_parallel_size=2 \
    ++model.optim.lr=1e-5 \
    ++trainer.sft.max_epochs=2

For the NeMo-RL backend, use the following training command. Add --disable_wandb to disable W&B logging. Only run one of the training commands, not both (or change the paths and exp names accordingly).

ns nemo_rl sft \
    --cluster=local \
    --expname=training \
    --run_after=download-14b \
    --run_after=prepare-sft-data \
    --output_dir=/workspace/training \
    --hf_model=/workspace/Qwen2.5-14B-Instruct \
    --num_nodes=1 \
    --num_gpus=8 \
    --training_data=/workspace/sft-data.jsonl \
    --cache_dir=/workspace/nemo-rl-cache \
    --final_hf_path=/workspace/training/qwen2.5-14b-improved-hf \
    ++sft.max_num_epochs=4 \
    ++policy.dtensor_cfg.tensor_parallel_size=8 \
    ++policy.max_total_sequence_length=8192 \
    ++policy.train_global_batch_size=32 \
    ++policy.optimizer.kwargs.lr=1e-5 \
    ++policy.dtensor_cfg.sequence_parallel=true \
    ++policy.dtensor_cfg.activation_checkpointing=true

To learn more about SFT configuration, see the NeMo-Skills training documentation. If you have W&B logging enabled, you can inspect the training metrics there.

Screenshot of training metrics in the Weights & Biases dashboard. — *Figure 2. Training metrics in the W&B dashboard*

Final evaluation

To check model improvement, run another evaluation. Convert the checkpoint back into Hugging Face format for faster evaluation. You can skip the conversion step if you’re using the NeMo-RL backend for training.

# converting back to HF format
ns convert \
   --cluster=local \
   --expname=convert-14b-hf \
   --run_after=training \
   --input_model=/workspace/training/model-averaged-nemo \
   --output_model=/workspace/training/qwen2.5-14b-improved-hf \
   --convert_from=nemo \
   --convert_to=hf \
   --num_gpus=8 \
   --model_type=qwen \
   --hf_model_name=Qwen/Qwen2.5-14B-Instruct
# launching evaluation
ns eval \
    --cluster=local \
    --expname=final-eval \
    --run_after=convert-14b-hf \
    --run_after=training \
    --model=/workspace/training/qwen2.5-14b-improved-hf \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=aime24:8,aime25:8 \
    --output_dir=/workspace/evals/after-training \
    ++inference.tokens_to_generate=16384
# summarize results, after the evaluation job is done
ns summarize_results --cluster=local /workspace/evals/after-training --wandb_name=after-training-evals

This evaluation should show good improvements for both benchmarks. Figure 3 shows evaluation results in the W&B dashboard. Switch to the Runs panel and click on Columns to customize the displayed metrics.

--------------------------------- aime24 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 13362      | 27.92%           | 55.83%
majority@8      | 30          | 13362      | 40.00%           | 16.67%
pass@8          | 30          | 13362      | 50.00%           | 16.67%
 
 
--------------------------------- aime25 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 13445      | 17.92%           | 53.33%
majority@8      | 30          | 13445      | 26.67%           | 10.00%
pass@8          | 30          | 13445      | 36.67%           | 10.00%

Screenshot of evaluation results in the Weights & Biases dashboard. — *Figure 3. Evaluation results in the W&B dashboard*

Improve any LLM with NeMo-Skills

With NeMo-Skills, you can easily build sophisticated pipelines by connecting the various stages needed to improve LLM abilities. This enables you to seamlessly switch between different training and inference frameworks. All the commands used in this tutorial can be combined into a single script that schedules the entire job. With just one line change, you can transition from quick prototyping on your local workstation to large-scale experiments on a Slurm cluster.

As an exercise, try adding the extra filtering steps mentioned in the NeMo-Skills documentation. You can also try generating multiple solutions per problem and check how this affects final evaluation results. As you will see, having a single script that runs everything—from data generation to model training to evaluation—makes it very easy to iterate on changes to any part of the pipeline.

The NVIDIA team successfully used NeMo-Skills to develop several popular models and datasets. Specifically, it was used to:

Power the winning NVIDIA submission to AIMO-2 competition and develop the OpenMathReasoning dataset and OpenMath-Nemotron models.
Create OpenCodeReasoning and OpenCodeInstruct collections.
Curate a portion of post-training data for the Llama-Nemotron model series.

Get started with NeMo-Skills to leverage it for your own projects.

How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills

Set up NeMo-Skills locally or on Slurm

Establish a baseline

Using SDG with TensorRT-LLM or vLLM

Model training with NeMo

Final evaluation

Improve any LLM with NeMo-Skills

Related resources

Tags

About the Authors

How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills

Set up NeMo-Skills locally or on Slurm

Establish a baseline

Using SDG with TensorRT-LLM or vLLM

Model training with NeMo

Final evaluation

Improve any LLM with NeMo-Skills

Related resources

Tags

About the Authors

Comments

Related posts

Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO

Run Hugging Face Models Instantly with Day-0 Support from NVIDIA NeMo Framework

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility

Join the First NVIDIA LLM Developer Day: Elevate Your App-Building Skills

Related posts

Train a Reasoning-Capable LLM in One Weekend with NVIDIA NeMo

Understanding NCCL Tuning to Accelerate GPU-to-GPU Communication

Kimi-K2-Instruct Now Available as NVIDIA NIM

Traditional RAG vs. Agentic RAG—Why AI Agents Need Dynamic Knowledge to Get Smarter

Hackathon Winners Bring Agentic AI to Life with the NVIDIA NeMo Agent Toolkit