Building Nemotron-CC, A High-Quality Trillion Token Dataset for LLM Pretraining from Common Crawl Using NVIDIA NeMo Curator

Curating high-quality pretraining datasets is critical for enterprise developers aiming to train state-of-the-art large language models (LLMs). To enable developers to build highly accurate LLMs, NVIDIA previously released Nemotron-CC, a 6.3-trillion-token English language Common Crawl (CC) dataset. Today, the NVIDIA NeMo Curator team is excited to share that the pipeline used to build the Nemotron-CC dataset has now been merged into the NeMo Curator GitHub repository.

The Nemotron-CC pipeline, now integrated into NeMo Curator, presents a novel solution for balancing the trade-offs between accuracy and data quantity at scale. Using a combination of classifier ensembling and synthetic data rephrasing, the Nemotron-CC pipeline offers a scalable method to generate high-quality synthetic data from the original dataset to expand it.?

Benefits of Nemotron-CC data curation pipeline

Traditional data curation methods are limited due to the heavy use of heuristic filtering that can’t assess semantic quality, leading to discarded low-quality text that could otherwise be repurposed. As a result, models trained on such suboptimal data hit accuracy plateaus, especially on complex reasoning tasks like MMLU.

The Nemotron-CC pipeline, now integrated into NeMo Curator, introduces an innovative way to achieve trade-offs between accuracy and data quantity. Using a combination of classifier ensembling and synthetic data rephrasing, the Nemotron-CC pipeline offers a scalable method to generate high-quality synthetic data from the original dataset to expand it.

Using this pipeline, you can generate 2 trillion tokens of high-quality synthetic data to expand the final dataset—a first-of-its-kind recipe in the open-source domain. This synthetic data helps restore much of the content eliminated by filtering—up to 90 percent in datasets like DCLM and FineWebEdu. This level of data loss makes them unsuitable for long-horizon pretraining tasks, such as those requiring fifteen trillion tokens for models like Llama 3.1.

Deep dive: Nemotron-CC data curation pipeline

As shown in Figure 1, the Nemotron-CC pipeline addresses the gaps in traditional data curation by combining perplexity scoring, ensemble quality labeling, and synthetic data generation to transform a raw Common Crawl (CC) dataset into a refined, multitask-ready corpus.

Architecture diagram showing the various processing steps involved in the curation of the Nemotron-CC dataset, now merged into NeMo Curator. — *Figure 1. Different steps in Nemotron-CC data curation pipeline merged into NeMo Curator*

The data curation pipeline consists of several steps, beginning with taking the CC dataset and applying a variety of filters and processing steps. At a high level, the data curation pipeline has three stages.

HTML-to-Text Extraction and Filtering
Model-based Quality Labeling
Synthetic Data Generation (SDG)

Let’s dive deeper into each of these stages in the pipeline.

HTML-to-text extractor and filter

The pipeline uses jusText for HTML extraction and FastText for identifying English language data and formatting Unicode characters. After the text is extracted, we apply exact and fuzzy deduplication algorithms to remove duplicate and near-duplicate documents from the text.

NeMo Curator’s exact deduplication module efficiently identifies and removes identical documents from datasets by hashing each document and retaining only one document per hash. The fuzzy deduplication module on the other hand identifies and removes near-duplicate documents by computing MinHash signatures and employing locality sensitive hashing (LSH) to detect documents with high Jaccard similarity scores.

NeMo Curator deduplication modules leverage NVIDIA RAPIDS libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, you can achieve 16X faster processing for text when compared to alternatives.?

We then applied 28 distinct heuristic filters—covering non-alphanumeric content, numerical and URL ratios, whitespace inconsistencies, and duplicate patterns via various n-gram analyses—to ensure that only the highest-quality, most relevant data is retained. NeMo Curator simplifies this process by accepting a configuration file with the required filters and automatically setting up the filtering pipeline for all documents in the dataset, with optional GPU acceleration for enhanced performance.

Finally, as shown in the code block below, we used pre-trained KenLM models to create a `PerplexityFilter` module, generate perplexity scores, and filter documents based on a perplexity threshold. A lower perplexity score means that the text is predictable and fluent while a higher perplexity score means that the text is strange or incoherent.?

class PerplexityFilter(DocumentFilter):
 
  def __init__(self, threshold):
    self._kenlm_model = KenlmModel(model_path=models_dir, language="en")
    self._threshold = threshold
 
  def score_document(self, text: str):
    return self._kenlm_model.get_perplexity(text, normalize=True)
 
  def keep_document(self, score: int):
    return score <= self._threshold

Model-based quality labeling

We used an ensemble of three quality classifier models: FastText Quality Classifier, and NeMo Curator – FineWeb Mixtral Edu Classifier and FineWeb Nemotron-4 Edu Classifier.

The three quality classifiers have been trained to generate a score based on different preferences for quality. We ran these classifiers on every sample and ensembled their scores (0–5) to rank all documents, and bucket the corpus. These buckets are then grouped into five quality levels for handling them appropriately in the next Synthetic Data Generation step.

Each classifier model generates floating-point scores, which are mapped to integer categories ranging from 0 (worst quality) to 19 (best quality), using percentile-based thresholds to create 20 bins for easier comparison. The predictions from multiple classifiers are then combined into a single representative score (i.e., ensemble score) by taking the maximum of all integer scores across all classifiers.

These quality classifier models are already supported by NeMo Curator, so we simply initialized them, applied them to our input dataset, and generated an output dataset with additional score columns for each classifier. Moreover, the entire process benefited from GPU acceleration out of the box, as shown in the code block below.

# Import required classifiers
from nemo_curator.classifiers import FineWebNemotronEduClassifier, FineWebMixtralEduClassifier, FastTextQualityClassifier
from nemo_curator import get_client
 
# Start distributed client on GPU
client = get_client(cluster_type="gpu")
 
# Initiate all classifiers
classifiers = [
   FineWebNemotronEduClassifier(...),
   FineWebMixtralEduClassifier(...),
   FastTextQualityClassifier(...)
]
 
# Read input dataset with cuDF
input_dataset = DocumentDataset.read_parquet("./data.parquet", backend="cudf")
 
# Apply classifiers
output_dataset = input_dataset
for classifier in classifiers:
   output_dataset = classifier(dataset=output_dataset)
 
# Save output dataset.
output_dataset.to_parquet(path=quality_classification_results_dir)
 
# Stop Dask client
client.cluster.close()

Synthetic data generation

We use SDG pipelines to generate data from low- and high-quality documents. For low-quality documents, we repurpose the useful information by prompting a LLM to rewrite the text using a Wikipedia-style prompt.

For high-quality data, we generate more unique pretraining tokens by rephrasing or condensing essential knowledge from the text. To achieve this, we ran four different LLMs with different prompts to generate:

Diverse Question-Answer (QA) pairs: Ask questions in various forms such as yes/no, multiple-choice, and open-ended question
Distill: Rewrite the text into a concise and clear passage
Extract knowledge: Rewrite knowledge from the text and disregard uninformative content
Knowledge list: Extract key information from the text as an organized list

Results

As shown in Table 1, when a Llama 3.1 8B-parameter model is trained on a 1T-token subset of Nemotron-CC dataset, it improves the MMLU score by an impressive 5.6 points over the same LLM trained on the DCLM dataset. For the full benchmark results, please refer to the Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset paper.

Dataset	MMLU
FineWebEdu-2	42.4
FineWebEdu	42.9
DCLM	53.4
Nemotron-CC-HighQuality	59.0

Table 1. Results for 8B parameter models trained on 1T tokens

As shown in Table 2, training the Llama 3.1 8B model on long horizon tokens (15T tokens—including 7.2T from the Nemotron-CC dataset ) —led to a 5-point boost on MMLU benchmarks, scoring 70.3 compared to Llama’s 65.3.

Model	MMLU
Llama 3.1	65.3
Llama 3.1 trained with Nemotron-CC data	70.3

Table 2. Accuracy score comparison on the MMLU benchmark between the original Llama 3.1 and the Llama 3.1 model trained with Nemotron-CC data

Get started

Use the Nemotron-CC pipeline to generate high-quality datasets to either pretrain a foundation model or perform domain-adaptive pretraining (DAPT) for various domains, including energy, manufacturing, chemistry, and more. Thanks to NeMo Curator’s flexibility, individual components in this pipeline can be used in developing not just pretraining datasets but also fine-tuning datasets.

Check out the following links to get started:

GitHub Tutorial: Step-by-step Jupyter notebook for deploying the pipeline.
NeMo Curator APIs : A Pythonic interface for customizing various stages of the pipeline—for example, swapping classifiers

Also, don’t forget to star our NeMo Curator GitHub repository to receive regular updates on newly released features, tutorials, and to contribute your code to the repository.