Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs. Zyda-2 powers our Zamba2 series, pushing the boundaries of small-LLM performance and reinforcing Zyphra's position at the forefront of efficient, high-performance language model development.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
This is likely because, since the original datasets are not filtered by model-based classifiers (although Zyda-1has significant syntactic filtering), both datasets contain large numbers of web documents which do not help particularly with downstream language modeling tasks.
As an additional step, we filtered FineWeb-Edu2 by its educational score basically converting it to FineWeb-Edu, a higher-quality subset of FineWeb-Edu2 comprising samples with the higher score 3 based on the FineWeb quality filter model. This scoring method yielded the best performance in the ablations conducted during the creation of FineWeb-Edu. Notably, we applied this filtering after cross-deduplicating DCLM against FineWeb-Edu2, which involved eliminating all samples in DCLM deemed duplicates with FineWeb-Edu2. This sequence of steps indirectly applied the FineWeb quality classifier to DCLM, effectively removing the samples in DCLM with lower educational content.
Evaluating the quality of datasets is challenging without training large scale models upon them, which costs significant compute and is infeasible for every possible dataset ablation. One typical approach to get signal on such dataset ablations is to train small models on small slices of the dataset (for instance a 300 million parameter model on 100 billion tokens) and then compare the evaluation performance of such models. This is the approach taken by the vast majority of recent papers and datasets released. This approach has advantages in that it is computationally feasible to run many such experiments even on relatively constrained resources but for such small models many standard evaluations do not show strong signal (for instance MMLU remains at chance for such models) and often even when there is signal it is noisy.
An alternative approach is to utilize the newly popular annealing approach to training LLMs. It has been found that training an LLM on a large amount of data with a relatively slow learning-rate decay followed by a rapid learning rate decay over a high quality dataset can significantly improve performance. Many recent models have reported using such an annealing strategy although details are often sparse.
Blakeney et al proposes utilizing the annealing phase as a method to test the performance of different datasets. While not requiring significantly more compute than training a small model from scratch, annealing enables much larger and already trained base models to be used which can show signal on important evaluation metrics such as MMLU.
We validate the results proposed in this paper where we see significantly less signal on our dataset ablations using small models vs annealing. We perform our annealing ablations on our pre-annealing checkpoint of our Zamba2-2.7B model which we run for 40 billion tokens. Since our Zamba2-2.7B model is significantly above chance at MMLU and is already highly trained on approximately 3 trillion tokens we observe a clear signal on all standard ablations and additionally greater sensitivity to changes in dataset composition.
An important question, given that Zyda-2 is derived from four pre-existing datasets, is what the optimal weightings of these datasets is. We conducted a series of experiments to determine this using our annealing scheme described previously. We found that a uniform weighting of dataset proportion by the total number of tokens in the dataset is suboptimal and that upweighting Fineweb-Edu to an equal proportion as DCLM performed optimally. Due to their smaller sizes together Zyda-1and Dolma-CC in total make up approximately 5% of the total dataset.
For our quality filtering and deduplication steps we utilized NeMo Curator for dataset processing. We found that it performs extremely well for distributed multi-GPU dataset processing. NeMo Curator supports Infiniband, which enables excellent scaling on multi-node setups.
We benchmarked NeMo Curator against our previous CPU-based Python package Zyda-Processing. We compared running both packages on a single instance of a3-highgpu-8g on Google Cloud, which has 208 vCPUs, 1.9TB of RAM and 8 H100s 80GB GPUs. This is a typical setup for a node of H100s.
We found that running NeMo Curator on this node using GPUs is 10 times faster than running our Zyda package. Processing the whole raw dataset (which is roughly 12 billion documents with 11.5T tokens) would only take roughly 2 days on 8 GPU nodes. At the same time processing on 8 CPU nodes would take roughly 3 weeks.
We find that performant CPU nodes with enough RAM and good networking capabilities typically cost 5 times less than a node of H100s. By leveraging NeMo Curator and accelerating the data processing pipelines on GPUs, our team reduced the total cost of ownership (TCO) by 2x and processed the data 10x faster (from 3 weeks to 2 days).
In conclusion, NeMo Curator is a powerful tool that allows us to not only to scale our data processing capabilities, but optimize costs.
Download the Zyda-2 dataset directly from Hugging Face and train higher-accuracy models. For more information see our GitHub tutorial.
Evaluating the quality of datasets is challenging without training large scale models upon them, which costs significant compute and is infeasible for every possible dataset ablation. One typical approach to get signal on such dataset ablations is to train small models on small slices of the dataset (for instance a 300 million parameter model on 100 billion tokens) and then compare the evaluation performance of such models. This is the approach taken by the vast majority of recent papers and datasets released. This approach has advantages in that it is computationally feasible to run many such experiments even on relatively constrained resources but for such small models many standard evaluations do not show strong signal (for instance MMLU remains at chance for such models) and often even when there is signal it is noisy.
An alternative approach is to utilize the newly popular annealing approach to training LLMs. It has been found that training an LLM on a large amount of data with a relatively slow learning-rate decay followed by a rapid learning rate decay over a high quality dataset can significantly improve performance. Many recent models have reported using such an annealing strategy although details are often sparse.
Blakeney et al proposes utilizing the annealing phase as a method to test the performance of different datasets. While not requiring significantly more compute than training a small model from scratch, annealing enables much larger and already trained base models to be used which can show signal on important evaluation metrics such as MMLU.
We validate the results proposed in this paper where we see significantly less signal on our dataset ablations using small models vs annealing. We perform our annealing ablations on our pre-annealing checkpoint of our Zamba2-2.7B model which we run for 40 billion tokens. Since our Zamba2-2.7B model is significantly above chance at MMLU and is already highly trained on approximately 3 trillion tokens we observe a clear signal on all standard ablations and additionally greater sensitivity to changes in dataset composition.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
This is likely because, since the original datasets are not filtered by model-based classifiers (although Zyda-1has significant syntactic filtering), both datasets contain large numbers of web documents which do not help particularly with downstream language modeling tasks.
As an additional step, we filtered FineWeb-Edu2 by its educational score basically converting it to FineWeb-Edu, a higher-quality subset of FineWeb-Edu2 comprising samples with the higher score 3 based on the FineWeb quality filter model. This scoring method yielded the best performance in the ablations conducted during the creation of FineWeb-Edu. Notably, we applied this filtering after cross-deduplicating DCLM against FineWeb-Edu2, which involved eliminating all samples in DCLM deemed duplicates with FineWeb-Edu2. This sequence of steps indirectly applied the FineWeb quality classifier to DCLM, effectively removing the samples in DCLM with lower educational content.
For our quality filtering and deduplication steps we utilized NeMo Curator for dataset processing. We found that it performs extremely well for distributed multi-GPU dataset processing. NeMo Curator supports Infiniband, which enables excellent scaling on multi-node setups.
We benchmarked NeMo Curator against our previous CPU-based Python package Zyda-Processing. We compared running both packages on a single instance of a3-highgpu-8g on Google Cloud, which has 208 vCPUs, 1.9TB of RAM and 8 H100s 80GB GPUs. This is a typical setup for a node of H100s.
We found that running NeMo Curator on this node using GPUs is 10 times faster than running our Zyda package. Processing the whole raw dataset (which is roughly 12 billion documents with 11.5T tokens) would only take roughly 2 days on 8 GPU nodes. At the same time processing on 8 CPU nodes would take roughly 3 weeks.
We find that performant CPU nodes with enough RAM and good networking capabilities typically cost 5 times less than a node of H100s. By leveraging NeMo Curator and accelerating the data processing pipelines on GPUs, our team reduced the total cost of ownership (TCO) by 2x and processed the data 10x faster (from 3 weeks to 2 days).
In conclusion, NeMo Curator is a powerful tool that allows us to not only to scale our data processing capabilities, but optimize costs.
Download the Zyda-2 dataset directly from Hugging Face and train higher-accuracy models. For more information see our GitHub tutorial.
Evaluating the quality of datasets is challenging without training large scale models upon them, which costs significant compute and is infeasible for every possible dataset ablation. One typical approach to get signal on such dataset ablations is to train small models on small slices of the dataset (for instance a 300 million parameter model on 100 billion tokens) and then compare the evaluation performance of such models. This is the approach taken by the vast majority of recent papers and datasets released. This approach has advantages in that it is computationally feasible to run many such experiments even on relatively constrained resources but for such small models many standard evaluations do not show strong signal (for instance MMLU remains at chance for such models) and often even when there is signal it is noisy.
An alternative approach is to utilize the newly popular annealing approach to training LLMs. It has been found that training an LLM on a large amount of data with a relatively slow learning-rate decay followed by a rapid learning rate decay over a high quality dataset can significantly improve performance. Many recent models have reported using such an annealing strategy although details are often sparse.
Blakeney et al proposes utilizing the annealing phase as a method to test the performance of different datasets. While not requiring significantly more compute than training a small model from scratch, annealing enables much larger and already trained base models to be used which can show signal on important evaluation metrics such as MMLU.
We validate the results proposed in this paper where we see significantly less signal on our dataset ablations using small models vs annealing. We perform our annealing ablations on our pre-annealing checkpoint of our Zamba2-2.7B model which we run for 40 billion tokens. Since our Zamba2-2.7B model is significantly above chance at MMLU and is already highly trained on approximately 3 trillion tokens we observe a clear signal on all standard ablations and additionally greater sensitivity to changes in dataset composition.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
At Zyphra, one of our key priorities is producing the highest-quality and most efficient models for a given parameter budget, with a special focus on small, highly powerful models which can automate many tasks extremely cheaply and be utilized on consumer and edge devices. This has driven our innovations in model architecture and also necessitates a focus on constructing strong pretraining and annealing datasets.
Our general approach is to first collect strong open-source datasets (like Dolma, FineWeb, and DCLM), then remove duplicates and aggressively filter them to obtain the highest-quality subset that meets our training budget. Our first dataset was Zyda-1 used to train Zamba-7B, which was a dataset composed of 1.3T tokens outperforming all open major language modeling datasets at the time, such as Dolma-1.6, FineWeb, and RefinedWeb. For this dataset, we constructed our own data processing pipeline which is open-sourced in the Zyda-processing library which entirely relies on CPU compute for both deduplication and filtering steps.
For our next round of datasets we aimed to significantly increase the scale of the dataset beyond Zyda-1 as well as the quality, to enable us to reach the frontier of small-LLM performance, which we have reached with our Zamba2 series of models. Because of this, we realized that our previous hand-made CPU-based data processing pipeline would not be sufficient and turned to NVIDIA NeMo Curator to speed up development time, lower costs and decrease runtime of our dataset processing significantly.
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization.
By providing ready-made scripts for filtering and deduplication, as well as its own suite of topic models, NeMo Curator enables rapid prototyping and development of dataset processing workflows.
Additionally, due to its deep integration with Dask and optimization for GPU processing, NeMo Curator enables much more straightforward parallelization of jobs and the ability to perform certain data processing stages on GPUs leading to substantial speedup – from 3 weeks on CPU to 2 days on GPU.
Zyda-2 is an approximately 5-trillion token permissively licensed dataset which is available on HuggingFace. Models trained on this dataset perform extremely strongly on downstream standardized evaluation tasks, beating current state-of-the-art datasets such as DCLM and FineWeb-Edu. Our leading Zamba2 series of models was trained on an early version of this dataset.
Zyda-2 comprises the majority of open-source high-quality tokens available online. Specifically, it consists of DCLM, FineWeb-Edu, Zyda-1, and the Common-Crawl portion of Dolma v1.7.
DCLM and FineWeb-Edu have already undergone significant quality filtering as a core component of their creation and so perform extremely strongly on many benchmark tasks. Zyda-1 and Dolma-CC are less filtered datasets and contain a higher variety of internet documents which are not all designed to be educational. While these datasets lead to worse performance on small model ablations, we believe that for larger models having greater dataset diversity is extremely important and indeed combining the Zyda-1 and Dolma-CC portions with FineWeb-Edu and DCLM leads to increased downstream evaluation performance.
We perform cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we perform model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality-classifier-deberta, keeping only the ‘high-quality’ subset of these datasets.
The combination of all these datasets improves overall performance so that Zyda-2 outperforms each of its individual component datasets. We believe this is both due to our cross-deduplication efforts as well as an ensembling effect where combining a variety of data sources obtained through different processing pipelines leads to more diverse data which improves model performance and robustness.
To create Zyda-2 we first obtained our starting data sources, DCLM, FineWeb-Edu-score-2 (FineWeb-Edu2 for short), Zyda-1 and Common Crawl portion of Dolma v1.7 (Dolma-CC for short), and then put them through a two-stage pipeline (see Fig. 3). We chose the FineWeb-Edu2 version of FineWeb-Edu to have a much bigger starting pool of documents. The first stage was a cross-deduplication stage applied to all datasets followed by a model-based filtering stage applied to Zyda-1 and Dolma-CC.
As the final step we decided to filter FineWeb-Edu2 by its educational score, essentially converting it to FineWeb-Edu, which is the recommended version of FineWeb-Edu dataset. We elaborate on the reasons below.
This whole process results in the best possible version of each subset, comprising the final dataset of 5T tokens.
Our first phase was to perform deduplication across our component dataset since, given that they all ultimately originated from likely similar Common Crawl scrapes, they may contain significant fractions of duplicated documents. Deduplication has been found to generally improve language modeling performance although recent papers have claimed its effects are neutral and maybe negative, which we discuss later.
As expected we found a significant number of duplicated documents across datasets, which resulted in us removing approximately 32% of the total tokens from Zyda-2 compared to its source dataset. We used the approximate minhash LSH deduplication that is provided in NeMo Curator for our deduplication pipelinewith the following parameters: minhash with signature size of 128 computed on character-based 25-grams signatures and split into 8 bands, giving roughly 85% Jaccard similarity threshold. Then we constructed an undirected graph with nodes being documents and edges being duplicates, and found connected components in it, which basically provided us with clusters of duplicates. Duplicates in FineWeb-Edu2 and DCLM.
While performing our deduplication analyses, we also performed a full intra-dataset deduplication of both DCLM and FineWeb-Edu. We found that both datasets contained a very large number of internal duplicates. For FineWeb-Edu this had previously been noticed but our result is also true of DCLM where we believe we have found 80% duplicates. Both the DCLM and FineWeb-Edu papers claim that internal dataset deduplication did not improve their results and so they did not pursue it or release a deduplicated version of their datasets.
These results raise several interesting questions. Firstly, it is unclear why removing duplicates does not improve performance. This implies that documents with large numbers of identical tokens provide approximately the same benefit as fresh tokens from the same distribution. We can perhaps consider a highly duplicated dataset like FineWeb-Edu as equivalent to performing a 2-5 epoch shuffled run on a much smaller deduplicated dataset. This implies that a small number of epochs do not particularly harm language model performance at these scales. However, this effect may not hold true at larger scales where language models can perform much more sample-efficient memorization of the training dataset, thus requiring more regularization to counter overfitting on multi-epoched text data.
If this is true, however, it is unclear why earlier results showed positive effects of deduplication. One possible hypothesis we hold is that the early datasets where deduplication was found to hold were not strongly filtered by model-based classifiers like DCLM and FineWeb and thus the deduplication step may simply be removing many low-quality spam documents that occur often in unfiltered web data, and hence the deduplication may simply be acting as an implicit filter. However, in more stringently filtered datasets, the magnitude of this effect diminishes and disappears in the evaluation noise. However, we do note that the most frequent duplicate in DCLM is indeed a spam document which indeed seems to have been low quality. Additionally, we notice that on supposedly ‘duplicate’ data, the scores assigned by the quality model often vary significantly despite only very minor differences in text, which is perhaps indicative of the relative lack of robustness in some of these classifiers or the presence of significant fractions of false-positives in the dedupe step.
To be cautious, we decided not to change the deduplication strategy of DCLM and FineWeb-Edu at this point, and we limited the extent of deduplication to only cross deduplication between the datasets. So we only considered clusters that contained documents coming from different sources. For every cluster we sorted documents based on the ranking FineWeb-Edu2 -> DCLM -> Zyda-1 -> Dolma-CC, preserved documents from the highest ranking source and removed the rest.
Since Zyda-1and Dolma-CC, unlike FineWeb-Edu and DCLM, have not been processed by model-based filtering, we decided to experiment with the quality-classifier-deberta provided in NeMo Curator. We applied this filter to Zyda-1and Dolma-CC and experimented with removing either only the ‘low’ quality documents or keeping only the ‘high quality’ ones. We found that keeping only the high quality documents 10-20% of the total significantly improved model performance.
This is likely because, since the original datasets are not filtered by model-based classifiers (although Zyda-1has significant syntactic filtering), both datasets contain large numbers of web documents which do not help particularly with downstream language modeling tasks.
As an additional step, we filtered FineWeb-Edu2 by its educational score basically converting it to FineWeb-Edu, a higher-quality subset of FineWeb-Edu2 comprising samples with the higher score 3 based on the FineWeb quality filter model. This scoring method yielded the best performance in the ablations conducted during the creation of FineWeb-Edu. Notably, we applied this filtering after cross-deduplicating DCLM against FineWeb-Edu2, which involved eliminating all samples in DCLM deemed duplicates with FineWeb-Edu2. This sequence of steps indirectly applied the FineWeb quality classifier to DCLM, effectively removing the samples in DCLM with lower educational content.
Evaluating the quality of datasets is challenging without training large scale models upon them, which costs significant compute and is infeasible for every possible dataset ablation. One typical approach to get signal on such dataset ablations is to train small models on small slices of the dataset (for instance a 300 million parameter model on 100 billion tokens) and then compare the evaluation performance of such models. This is the approach taken by the vast majority of recent papers and datasets released. This approach has advantages in that it is computationally feasible to run many such experiments even on relatively constrained resources but for such small models many standard evaluations do not show strong signal (for instance MMLU remains at chance for such models) and often even when there is signal it is noisy.
An alternative approach is to utilize the newly popular annealing approach to training LLMs. It has been found that training an LLM on a large amount of data with a relatively slow learning-rate decay followed by a rapid learning rate decay over a high quality dataset can significantly improve performance. Many recent models have reported using such an annealing strategy although details are often sparse.
Blakeney et al proposes utilizing the annealing phase as a method to test the performance of different datasets. While not requiring significantly more compute than training a small model from scratch, annealing enables much larger and already trained base models to be used which can show signal on important evaluation metrics such as MMLU.
We validate the results proposed in this paper where we see significantly less signal on our dataset ablations using small models vs annealing. We perform our annealing ablations on our pre-annealing checkpoint of our Zamba2-2.7B model which we run for 40 billion tokens. Since our Zamba2-2.7B model is significantly above chance at MMLU and is already highly trained on approximately 3 trillion tokens we observe a clear signal on all standard ablations and additionally greater sensitivity to changes in dataset composition.
An important question, given that Zyda-2 is derived from four pre-existing datasets, is what the optimal weightings of these datasets is. We conducted a series of experiments to determine this using our annealing scheme described previously. We found that a uniform weighting of dataset proportion by the total number of tokens in the dataset is suboptimal and that upweighting Fineweb-Edu to an equal proportion as DCLM performed optimally. Due to their smaller sizes together Zyda-1and Dolma-CC in total make up approximately 5% of the total dataset.
For our quality filtering and deduplication steps we utilized NeMo Curator for dataset processing. We found that it performs extremely well for distributed multi-GPU dataset processing. NeMo Curator supports Infiniband, which enables excellent scaling on multi-node setups.
We benchmarked NeMo Curator against our previous CPU-based Python package Zyda-Processing. We compared running both packages on a single instance of a3-highgpu-8g on Google Cloud, which has 208 vCPUs, 1.9TB of RAM and 8 H100s 80GB GPUs. This is a typical setup for a node of H100s.
We found that running NeMo Curator on this node using GPUs is 10 times faster than running our Zyda package. Processing the whole raw dataset (which is roughly 12 billion documents with 11.5T tokens) would only take roughly 2 days on 8 GPU nodes. At the same time processing on 8 CPU nodes would take roughly 3 weeks.
We find that performant CPU nodes with enough RAM and good networking capabilities typically cost 5 times less than a node of H100s. By leveraging NeMo Curator and accelerating the data processing pipelines on GPUs, our team reduced the total cost of ownership (TCO) by 2x and processed the data 10x faster (from 3 weeks to 2 days).
In conclusion, NeMo Curator is a powerful tool that allows us to not only to scale our data processing capabilities, but optimize costs.
Download the Zyda-2 dataset directly from Hugging Face and train higher-accuracy models. For more information see our GitHub tutorial.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.