Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance. We believe Zamba2-7B is the leading model for running on-device and on consumer GPUs as well as for many enterprise applications which require a powerful but compact and efficient model for natural-language tasks.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
Zamba2 performs exceptionally well on standard language modeling evaluation sets, especially given its latency and generation speed. Among small language models (≤8B), we lead the pack in both quality and performance.
Our model outperforms existing state-of-the-art models for the following reasons:
Due to the exceptional quality of our pretraining and annealing datasets, Zamba2-7B performs extremely well on a per-training-token basis, sitting comfortably above the curve traced out by competitor models.
Zamba2-7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings of the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
We achieve state-of-the-art inference efficiency, including latency, throughput and memory usage because:
Zamba2-7B was trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM. Zamba2-7B thus demonstrates that at the 7B scale the frontier is still reachable and surpassable with a small team and moderate budget.
Zamba2-7B is released under an open source license, allowing researchers, developers, and companies to leverage its capabilities. We invite the broader AI community to explore Zamba's unique architecture and continue pushing the boundaries of efficient foundation models.
Instruct Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Base Zamba2-7B: https://huggingface.co/Zyphra/Zamba2-7B
Pure PyTorch: https://github.com/Zyphra/Zamba2
Zyphra's team is committed to democratizing advanced AI systems, exploring novel architectures on the frontier of performance, and advancing the scientific study and understanding of powerful models. We look forward to collaborating with others who share our vision.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.