Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling. Zyda combines the existing suite of high-quality open datasets together and merges them through a uniform and thorough filtering and deduplication process. The goal of Zyda is to provide a simple, accessible, and highly performant dataset for language modeling experiments and training up to the 1 trillion scale. In our ablation studies, Zyda outperforms all existing open datasets including the Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
Zyda was created by merging and then applying a uniform and meticulous post-processing pipeline to seven well-respected open language modelling datasets: RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. We performed thorough syntactic filtering of the datasets to remove many clearly low-quality documents, before we performed an aggressive deduplication pass, both within and between the datasets. Cross-deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets – likely because the datasets are mostly drawn from common sources such as Common Crawl. In total, we discarded approximately 40% of our initial dataset, reducing its token count from approximately 2T tokens to 1.3T.
Scores of Zamba (trained on Zyda) vs competing datasets. We observe that on a per-token basis Zamba is significantly stronger than competing models, testifying to the strength of Zyda as a pretraining dataset.
We created Zyda as part of our own internal dataset efforts and an early version of Zyda was used to train Zamba, which performs strongly on a per-training-token basis.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.