Run Hugging Face Models Instantly with Day-0 Support from NVIDIA NeMo Framework

As organizations strive to maximize the value of their generative AI investments, accessing the latest model developments is crucial to continued success. By using state-of-the-art models on Day-0, teams can harness these innovations efficiently, maintain relevance, and be competitive.

The past year has seen a flurry of exciting model series releases in the open-source community, including Meta Llama, Google Gemma, Mistral Codestral, Codestral Mamba, Large 2, Mixtral, Qwen 3, 2, and 2.5, Deepseek R1, NVIDIA Nemotron, and NVIDIA Llama Nemotron. These models are often made available on the Hugging Face Hub, providing the broader community with easy access.

Shortly after release, many users focus on evaluating model capabilities and exploring potential applications. Fine-tuning for specific use cases often becomes a key priority to gain an understanding of the models’ potential and to identify opportunities for innovation.

The NVIDIA NeMo Framework uses NVIDIA Megatron-Core and Transformer-Engine (TE) backends to achieve high throughput and Model Flops Utilization (MFU) on thousands of NVIDIA GPUs, driving exceptional performance. However, integrating new model architectures into the NeMo framework requires multi-stage model conversion using Megatron-Core primitives, followed by validation of different phases, including supervised and parameter-efficient finetuning, model evaluation, and Hugging Face to NeMo conversion. This introduces a time delay between model release and optimal training/post-training recipe development.

To ensure Day-0 support for the latest models, NeMo framework introduces the Automatic Model (AutoModel) feature.

Introducing AutoModel in NVIDIA NeMo Framework

AutoModel is a high-level interface designed to simplify supporting pretrained models. As part of the NeMo framework, it enables users to seamlessly fine-tune any Hugging Face model for quick experimentation. AutoModel currently covers the text generation and vision language models categories, with plans to expand into more categories, such as video generation.

A diagram showing the integration of Hugging Face models with NVIDIA NeMo framework using the AutoModel feature. — *Figure 1. AutoModel provides the NeMo framework a seamless integration with Hugging Face Models.*

The AutoModel feature provides out-of-the-box support for:

Model parallelism to enable scaling—currently through Fully-Sharded Data Parallelism 2 (FSDP2) and Distributed Data Parallel (DDP), with Tensor Parallelism (TP) and Context Parallelism (CP) coming soon.
Enhanced PyTorch performance with JIT compilation.
Seamless transition to the latest optimal training and post-training recipes powered by Megatron-Core, as they become available.
Export to vLLM for optimized inference, with NVIDIA TensorRT-LLM export coming soon.

By using the Hugging Face ecosystem, AutoModel enables effortless integration of a vast array of LLMs, without requiring explicit checkpoint rewrites. All models are natively supported, with a subset of the most popular also receiving optimized Megatron-Core support.

An image showing the two training workflows with NVIDIA NeMo Framework—the existing Megatron-Core path, and the new AutoModel path. — *Figure 2. NeMo framework training workflow with the new AutoModel path that offers Day 0 support*

	Megatron-Core Backend	AutoModel Backend
Coverage	Most popular LLMs with recipes tuned by experts	All models supported in Hugging Face Text on Day-0
Training Throughput Performance	Optimal Throughput with Megatron-Core kernels	Good Performance with liger kernels, cut cross entropy and PyTorch JIT
Scalability	Up to 1,000 GPUs with full 4-D parallelism (TP, PP, CP, EP)	Comparable scalability using PyTorch native TP, CP, and FSDP2 at slightly reduced training throughput
Inference Path	Export to TensorRT-LLM, vLLM, or directly to NVIDIA NIM	Export to vLLM

Table 1. Comparison of the two backends in the NeMo framework: Megatron-Core and AutoModel

How to use AutoModel

To load and run LoRA and Supervised Finetuning (SFT) using AutoModel in the NeMo framework, follow these high-level steps:

Instantiate an HF model: Use llm.HFAutoModelForCausalLM to load any HF model, specifying the model_id parameter to select the desired model.
Add adapters: Utilize llm.peft.LoRA to add adapters to the model.
1. Specify LoRA target modules: Identify modules for adaptation using target_modules, with flexible matching via regex on Fully Qualified Names (FQN).
2. Configure None to tune all the parameters with SFT
Prepare data: Leverage HF’s datasets with llm.HFDatasetDataModule.
Configure parallelism: Specify model parallelism and sharding strategy using DDP and FSDP2 to scale across multiple nodes.

Refer to the pseudo-code below. You may find the full reference example in the NeMo framework GitHub.

from datasets import load_dataset
dataset = load_dataset("rajpurkar/squad", split="train")
dataset = dataset.map(formatting_prompts_func)
 
llm.api.finetune(
    # Model & PEFT scheme
    model=llm.HFAutoModelForCausalLM(model_id),
 
    # Setting peft=None will run full parameter SFT
    peft=llm.peft.LoRA(
       target_modules=['*_proj', ‘linear_qkv’],  # Regex-based selector
       dim=32,
    ),
 
    # Data
    data=llm.HFDatasetDataModule(dataset),
 
    # Optimizer
    optim=fdl.build(llm.adam.pytorch_adam_with_flat_lr(lr=1e-5)),
 
    # Trainer
    trainer=nl.Trainer(
        devices=args.devices,
        max_steps=args.max_steps,
        strategy=args.strategy,  # choices= [None, ‘ddp’, FSDP2Strategy]        ...
   ),
)

Switching to the Megatron-Core supported path for maximum throughput is straightforward with minimal code changes, made possible with a consistent API.

Model class

Instead of AutoModel: model=llm.HFAutoModelForCausalLM(model_id)
Use Megatron-Core: model=llm.LlamaModel(Llama32Config1B())

Optimizer module

Instead of AutoModel: optim=fdl.build(llm.adam.pytorch_adam_with_flat_lr(lr=1e-5))
Use Megatron-Core: optim=MegatronOptimizerModule(config=opt_config, ...)

Trainer Strategy

Instead of AutoModel: strategy=args.strategy,#choices= [None, ‘ddp’, ‘fsdp2’]
Use Megatron-Core: strategy=nl.MegatronStrategy(ddp="pytorch", …)

This enables optimal performance in training and post-training with minimal overhead.

Adding a new AutoModel class in NeMo

Currently, NeMo AutoModel supports the AutoModelForCausalLM class for text generation.

If you want to add support for other tasks (such as Sequence2SequenceLM), create a subclass similar to HFAutoModelForCausalLM, adapting the initializer, model configuration, training/validation steps, and save/load methods for your specific use case. Also, implement appropriate checkpoint handling and create a new data module with custom batch preprocessing for your dataset.

You’ll find more comprehensive steps in the NeMo framework documentation. By following them and using existing classes as references, you can quickly extend NeMo AutoModel to support new tasks and models!

Conclusion

The AutoModel feature in the NeMo framework enables rapid experimentation with a performant implementation, natively supporting Hugging Face models without requiring model conversions. Additionally, it provides a seamless “opt-in” to the high-performance Megatron-core path, allowing users to easily switch to optimized training with minimal code changes.

AutoModel was introduced with the NeMo framework 25.02 release. To get started, refer to the AutoModel tutorial notebooks for PEFT LoRA, SFT, and multinode scaling. We also invite the developer community to share feedback, contribute code, to help shape the future development of AutoModel.