Back to Newsroom
June 7, 2024
PALO ALTO, CALIFORNIA

Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling. Zyda combines the existing suite of high-quality open datasets together and merges them through a uniform and thorough filtering and deduplication process. The goal of Zyda is to provide a simple, accessible, and highly performant dataset for language modeling experiments and training up to the 1 trillion scale. In our ablation studies, Zyda outperforms all existing open datasets including the Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama.

Authors
Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, James Whittington, Quentin Anthony
Collaborators
Daniel A Roberts (Sequoia Capital & MIT), Andrey Gromov (Meta FAIR), Kushal Tirumala (Meta FAIR) and Hassan Shapourian (Cisco)