Scientific research in complex fields like battery innovation is often slowed by manual evaluation of materials, limiting progress to just dozens of candidates per day. In this blog post, we explore how domain-adapted large language models (LLMs), enhanced with reasoning capabilities, are transforming scientific research, especially in high-stakes, complex domains like battery innovation. We dive into SES AI’s Molecular Universe LLM, a 70B parameter scientific LLM, to showcase a real-world implementation of this approach.

You’ll learn about the training and inference pipeline built using NVIDIA NeMo Curator, NVIDIA NeMo Framework, NVIDIA DGX Cloud, and NVIDIA NIM, along with how combining techniques such as domain adaptation, instruction tuning, and reasoning alignment accelerates scientific discovery while boosting expert productivity.?

Introduction

Flowchart depicting the Molecular Universe LLM training pipeline. — *Figure 1. The Molecular Universe LLM training pipeline*

LLMs have demonstrated immense potential in advancing scientific research, enabling tasks like summarizing papers, synthesizing complex insights, and generating novel hypotheses. However, general-purpose LLMs often fall short on domain-specific tasks due to limited exposure to specialized terminology and contextual knowledge during their pretraining.

To bridge the gap, domain-adapted LLMs offer a more viable solution. Rather than incurring the high cost and compute demands of training from scratch, Domain Adaptive Pretraining (DAPT) extends the functionality of existing foundation models (e.g., LLaMA) by curating custom and domain-relevant corpora.

This approach significantly enhances performance in specialized fields like science, while preserving the broad language capabilities of the original model. Further, the model is fine-tuned to enhance its ability to respond to general and task-specific queries. While domain adaptation and instruction tuning enhance task performance, they do not equip models with reasoning capabilities.

To address this gap, reasoning alignment is introduced, enabling models to logically navigate processes such as hypothesis generation, chain-of-thought reasoning, and self-correction. These capabilities are critical for solving multistep problems and driving material exploration.

SES AI, a company specializing in battery innovation, built a proprietary model, Molecular Universe LLM, a 70B parameter large custom reasoning model. Based on the Llama 3.1 70B, it sets a new benchmark for domain-specific scientific tasks, outperforming other models in its category.

It showcases a compute-efficient training and alignment strategy that transforms a base model into a high-performing, domain-adapted model, highlighting the effectiveness of combining DAPT, instruction tuning, and reasoning-based fine-tuning for specialized domain-specific tasks.

Molecular Universe LLM is an AI-driven battery research LLM that leverages advanced reasoning to rank potential electrolyte solvents and additives for synthesis. Previously, their scientists manually ranked electrolyte solvents and additives based on domain expertise – an effort limited to evaluating only a few dozen candidates per day.

By integrating long-context understanding, structured reasoning, and expert-level decision-making, this approach highlights how domain-adapted reasoning models are accelerating breakthroughs in scientific innovation, significantly augmenting battery experts’ productivity.

Molecular Universe LLM was trained on NVIDIA DGX Cloud using NVIDIA NeMo Framework through a three-step pipeline:

Step 1: Continuous pretraining on curated scientific literature, processed by NVIDIA NeMo Curator.
Step 2: Supervised fine-tuning(SFT) using synthetic data, generated by NVIDIA Llama 3.1 70B NIM.
Step 3: Post-training to align the model for complex scientific reasoning capabilities, on filtered s1K Reasoning Data.

This approach ensures the model delivers domain-specific, contextually relevant, and high-quality responses. By integrating Molecular Universe LLM with data from NVIDIA ALCHEMI GPU-accelerated simulations and molecular maps generated by NVIDIA cuML, SES AI reduces decades of battery research to months.

Let’s deep dive into the steps involved in building out this model.

A detailed workflow diagram illustrating the data pipeline and model training process for a scientific language model based on Llama 3.1. — *Figure 2. End-to-end workflow for training the Molecular Universe reasoning model*

Infrastructure setup

Molecular Universe LLM was trained on 128 NVIDIA H100 GPUs on NVIDIA DGX Cloud, a fully managed AI training platform, co-engineered with leading cloud providers. DGX Cloud includes NVIDIA-managed Kubernetes and NVIDIA Run:ai for workload optimization, job scheduling, and orchestration. Developers can get started with distributed training runs right away on a dedicated cluster, without the complexity of cluster bring-up or managing underlying infrastructure.

NVIDIA NeMo Framework was used as the AI development platform, on top of NVIDIA DGX Cloud, to deliver a seamless, accelerated experience to efficiently build, customize, and deploy generative AI models at scale. It supports state-of-the-art models and algorithms, while ensuring high training throughput and scalability over thousands of GPUs via 4D parallelism and other optimizations.

NVIDIA Run:ai enables administrators to orchestrate GPU capacity through “Projects” and “Departments” to ensure teams are allocated the required shares of capacity for their training workloads. The scheduler also supports workload bursting, which enables workloads to utilize additional capacity when the cluster has extra resources available. This increases GPU utilization while respecting resource allocations, maximizing developer productivity, and minimizing time to value.

Screenshot of the run:ai Workloads dashboard showing a PyTorchJob named 'mixtral-pretrain' running for 2 days, 20 hours, and 53 minutes. — *Figure 3. Example of GPU usage during a pretraining job running on NVIDIA DGX Cloud*

Step 1: Continuous pretraining

To establish a robust foundation of domain-specific knowledge in battery research, the Llama3.1 70B model underwent continuous pretraining. This involved training on a vast, curated corpus of scientific literature, enabling the model to develop a nuanced understanding and specialized expertise essential for accurate, context-aware responses.

Data curation and processing

The pretraining corpus consisted of 19M open-sourced papers from peer-reviewed journals or from preprint repositories. Refer to Table 1 for details on data sources.

The PDFs from diverse sources were converted to plain text. Before training, documents were extracted and processed using NeMo Curator, which applied advanced heuristic filtering and GPU-accelerated fuzzy deduplication techniques, including MinHash and Locality Sensitive Hashing. This rigorous pipeline reduced 19M raw samples to 17M unique, high-quality records. The NeMo Curator’s preprocessing ability was pivotal in eliminating redundancy, filtering out low-quality data, and retaining rich, domain-specific knowledge.

Data Sources	Documents
Peer-reviewed Literature from Open Source	~4M
arXiv	1.4M
ChemRxiv	26K
Open Research	12M
PubChem	60K
Academic Textbooks or Monographs	80
PLOS	200K

Table 1. A breakdown of data sources for domain-adaptive pre-training

Model architecture and training details

The Molecular Universe LLM Base model is built by adapting the pretrained weights of the LLaMA 3.1 70B base model. NeMo Framework was used for continued pretraining of the model, leveraging state-of-the-art optimization techniques including 4D parallelism, mixed precision training, and flash attention. Additionally, NeMo context parallelism played a critical role in enabling the model to handle long sequences of up to 8K tokens without compromising memory efficiency, speed, or stability.

The model was trained with an input sequence length of 8,192 tokens and processed 524,288 tokens per forward pass. The 128 NVIDIA H100 GPUs were used for training the model, with a total training time of 144 hours in bfloat16 precision. Domain-adaptive pretraining (DAPT) was performed on a fraction of the tokens used in the original pretraining and is significantly more efficient, requiring only about 1.5% of the total pretraining compute.

The training and validation loss profiles showed a rapid decline during the initial steps, reflecting swift domain adaptation. Over time, the losses stabilized, indicating effective convergence without signs of overfitting.

Step 2: Model alignment with supervised fine-tuning

To align the Molecular Universe Base model with domain-specific knowledge and improve instruction following capabilities, supervised fine-tuning (SFT) was employed. Supervised Fine-Tuning (SFT) trains the model on labeled examples to improve instruction following and task-specific response generation, especially in domain-specific contexts.

Data curation and processing

SES leveraged NVIDIA Llama 3.1 70B NIM for synthetic data generation (SDG) to create a high-quality SFT dataset. They sampled 50,000 papers and generated 200,000 instruction samples across four tasks—question answering, summarization, reading comprehension, and multiple-choice questions—with 160,000 used for training and 40,000 for evaluation.

Incorporating 90,000 general chat samples from the Daring-Anteater dataset, the final SFT dataset totaled 250,000 samples, with SDG providing the majority, highlighting the effectiveness of using the NIM to generate domain-specific training data.

Model architecture and training details

This dataset is then tokenized using the LLaMA 3.1 70B tokenizer and later fine-tuned using SFT on a multi-node system using the NeMo framework, resulting in the final Molecular Universe Chat model. The model was trained on DGX Cloud with 128 NVIDIA H100 GPUs and the NVIDIA Run:ai software, completing in just 32 hours.

The training and validation loss profiles exhibited a rapid initial decline, stabilizing around 400 steps. A slight increase in training loss after 600 steps suggests potential overfitting or sensitivity to the learning rate. However, the validation loss remained stable, indicating strong generalization performance.

Step 3: Post-training with high-quality reasoning data

While domain adaptive pretraining on scientific literature and instruction-based fine-tuning enhance the model’s ability to address general and domain-specific questions, they are not the best at solving complex scientific problems, which require multi-step reasoning.

To overcome this, the Molecular Universe Chat model is fine-tuned on a curated sample set (~25,000 samples) from s1K Reasoning Data, a dataset consisting of high-quality and difficult questions with distilled reasoning traces and solutions from Gemini Thinking. The s1K dataset is filtered to remove low-quality samples with formatting issues and questions easily answered by base models such as Qwen2.5 7B Instruct and Qwen2.5 32B Instruct.

Additionally, an LLM was used to cluster samples into thematic categories (e.g., math, science), and uniform sampling was applied with a bias towards examples containing longer reasoning traces to better capture task complexity. The resulting data samples were further decontaminated to remove task-specific benchmarks such as GPQA Diamond.

Post-training supervised fine-tuning was done using the NeMo Framework by increasing the context length to 16k to account for the reasoning traces. This step took about 12 hours on 64 H100 GPUs for 5 epochs of training, and not only improves factual accuracy but also enhances the model’s ability to reason through complex ideas, giving a score of 0.72 on GPQA Diamond.

Results

The Molecular Universe Chat and Reasoning model was evaluated on science-focused public benchmarks such as GPQA Diamond and on custom domain-specific benchmarks. It achieved a score of 0.72 on GPQA Diamond, surpassing most other notable, similarly sized, and even bigger open-source models such as DeepSeek-R1.

The Molecular Universe Reasoning model performed better than the LLama 3.1 70B on public benchmarks such as MMLU, Winogrande, Hellaswag, and ARC-E. This significant performance gain from the base starting model highlights the value of continued domain pretraining and reasoning-driven post-training in advancing model capabilities beyond instruction alignment alone.

A bar chart titled "Performance comparison between different SOTA reasoning models on the GPQA." — *Figure 4. Performance comparison between different SOTA reasoning models on the GPQA*

Model	# of Parameters	Battery Q/A	Battery MCQ	Battery RC	Battery Summarization	Battery Reasoning
GPT-o1	–	96%	92%	90%	88%	84%
Molecular Universe Reasoning	70B	96%	89%	90%	86%	82%
Claude 3.7 Sonnet	–	94%	86%	89%	86%	80%
Gemini Flash Thinking	–	92%	85%	88%	82%	79%
Molecular Universe Chat	70B	93%	79%	84%	79%	73%
LLaMA 3.1	70B	71%	67%	78%	75%	66%

Table 2. Performance comparison on battery-specific tasks, including Q/A, MCQ, reading comprehension, summarization, and reasoning

The Molecular Universe Chat and Reasoning models were further evaluated on the 40,000 SFT held-out test set and a custom battery-specific reasoning benchmark. The model was compared against models like GPT-o1, LLaMA 3.1 70B, Claude 3.7 Sonnet 50B, and Gemini 60B.

Across tasks like Q&A, MCQ, reading comprehension, summarization, and reasoning, Molecular Universe Reasoning LLM consistently outperformed all baselines except GPT-o1. Despite GPT-o1’s lead, partly due to its role in generating fine-tuning data, Molecular Universe Reasoning delivered competitive results with far fewer parameters and lower training costs, further emphasizing the impact of domain adaptation and reasoning alignment.

Conclusion and future work

Molecular Universe Reasoning, a 70B parameter scientific reasoning LLM, demonstrates state-of-the-art performance on scientific tasks within its size class. A compute-efficient training strategy, combining domain-adaptive pretraining with reasoning-based supervised fine-tuning, significantly improved performance over the base model, with minimal additional computational cost.

The combined use of both techniques proved valuable in outperforming either method in isolation and achieving results competitive with much larger models on both general and battery-specific benchmarks. Molecular Universe Reasoning model has been deployed using NIM microservices support for fine-tuned models, enabling scalable, real-time serving of the model that allows end-users to send multiple concurrent requests simultaneously. Molecular Universe LLM will be integrated into SES AI’s materials discovery platform, Molecular Universe (MU-0)—a unified software and service solution designed to help battery researchers and industry professionals explore a vast database of candidate small molecules through a single, consolidated search interface.

Future efforts will involve refining the model through domain-specific reasoning post-training, particularly by constructing a specialized battery-focused dataset to enhance task-relevant reasoning and exploring reinforcement learning with human feedback to further improve domain-specific performance. This work illustrates a path toward developing cost-efficient, mid-sized (<100B) domain-expert models with strong specialization capabilities in different domains.

To learn more about the NeMo Framework on NVIDIA DGX Cloud, visit the official NVIDIA docs and GitHub. Get started with NVIDIA DGX Cloud today. Discover NVIDIA ALCHEMI and explore NVIDIA cuML for advanced machine learning solutions.

Thanks to Zihan Wang (NVIDIA) and Kang Xu (SES) for their valuable support and insights.