Introducing ZR1-1.5B, a small but powerful math+code reasoning model

We introduce ZR1-1.5B, a small reasoning model trained extensively on both coding and mathematics problems with reinforcement learning. ZR1-1.5B outperforms many significantly larger general non-reasoning models on code generation, while maintaining performance close to state-of-the-art small reasoning models trained exclusively on math on competition-level evaluations.

Authors

Zyphra Team

Publication Date

April 10, 2025

Model

Introducing ZR1-1.5B, a small but powerful math+code reasoning model

We introduce ZR1-1.5B, a small reasoning model trained extensively on both coding and mathematics problems with reinforcement learning. ZR1-1.5B outperforms many significantly larger general non-reasoning models on code generation, while maintaining performance close to state-of-the-art small reasoning models trained exclusively on math on competition-level evaluations.

Beta Release of Zonos-v0.1

Zyphra is excited to release our beta Zonos-v0.1, a pair of extremely expressive text-to-speech (TTS) models with high fidelity voice cloning. We are releasing both a transformer and hybrid 1.6B TTS models with an Apache 2.0 license.

Beta Release of Zonos-v0.1

Zyphra is excited to release our beta Zonos-v0.1, a pair of extremely expressive text-to-speech (TTS) models with high fidelity voice cloning. We are releasing both a transformer and hybrid 1.6B TTS models with an Apache 2.0 license.

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

In this post, we describe our Mixture-of-PageRanks RAG system, which is built to perform long-context tasks in a highly computationally efficient manner. We describe key features of the algorithm and the SOTA results it achieves across a variety of long-context benchmarks.

Authors

Nick Alonso, Beren Millidge

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

In this post, we describe our Mixture-of-PageRanks RAG system, which is built to perform long-context tasks in a highly computationally efficient manner. We describe key features of the algorithm and the SOTA results it achieves across a variety of long-context benchmarks.

Authors

Nick Alonso, Beren Millidge

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300x accelerators.

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300x accelerators.

Reaching 1B Context Length with RAG

We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).

Authors

Nick Alonso, Beren Millidge

Reaching 1B Context Length with RAG

We demonstrate a retrieval system extending any off-the-shelf LLM to 1B (billion) context on a standard CPU during inference time. These preliminary results suggest our algorithm is a promising approach for performing long-context tasks especially in compute constrained scenarios (on device, cost-effective on-prem & cloud etc).

Authors

Nick Alonso, Beren Millidge

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs.

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs.

Zamba2-7B

Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance. We believe Zamba2-7B is the leading model for running on-device and on consumer GPUs as well as for many enterprise applications which require a powerful but compact and efficient model for natural-language tasks.

Zamba2-7B

Zyphra is excited to release Zamba2-7B, a state-of-the-art small language model. At the 7B scale, we outperform the leading models of Mistral, Google’s Gemma and Meta’s Llama3 series in both quality and performance. We believe Zamba2-7B is the leading model for running on-device and on consumer GPUs as well as for many enterprise applications which require a powerful but compact and efficient model for natural-language tasks.

Zamba2-mini (1.2B)

Zyphra is excited to release Zamba2-mini, a state-of-the-art SLM for on-device applications. Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization. Zamba2- mini (1.2B) ~ Llama2 7B

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Beren Millidge

Zamba2-mini (1.2B)

Zyphra is excited to release Zamba2-mini, a state-of-the-art SLM for on-device applications. Zamba2-mini achieves highly competitive evaluation scores and performance numbers and fits in a tiny memory footprint of <700MB at 4bit quantization. Zamba2- mini (1.2B) ~ Llama2 7B

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Beren Millidge

The Zyphra Training Cookbook

Training hybrid models is hard, and papers tend to gloss over the practical engineering work that goes into building good ones. The purpose of this cookbook is to enable other technical groups to hit the ground running when building their own hybrid (SSM, Transformer, MoE) models.

Authors

Quentin Anthony, Beren Millidge, Paolo Glorioso, and Yury Tokpanov

The Zyphra Training Cookbook

Training hybrid models is hard, and papers tend to gloss over the practical engineering work that goes into building good ones. The purpose of this cookbook is to enable other technical groups to hit the ground running when building their own hybrid (SSM, Transformer, MoE) models.

Authors

Quentin Anthony, Beren Millidge, Paolo Glorioso, and Yury Tokpanov

Understanding Graph-based RAG and Multi-Hop Question Answering

In this post, we discuss and illustrate the usefulness of graph-based RAG systems for multi-hop Question-Answering (QA) tasks. Multi-hop questions are those that require a chain of multiple retrieval steps to answer.

Authors

Authors: Nick Alonso, Beren Millidge

Understanding Graph-based RAG and Multi-Hop Question Answering

In this post, we discuss and illustrate the usefulness of graph-based RAG systems for multi-hop Question-Answering (QA) tasks. Multi-hop questions are those that require a chain of multiple retrieval steps to answer.

Authors

Authors: Nick Alonso, Beren Millidge

Edge LLMs: Benefits, Challenges, and Solutions

This blog post discusses the key factors to consider when deploying models on edge devices. We emphasize the significant hardware constraints of these devices, and identify techniques to efficiently utilize local hardware resources - quantization, low-rank adapters, and real-time parameter offloading from storage.

Authors

Andrew Greene, Kamil Rocki, Tomas Figliolia, Travis Oliphant, Beren Millidge

Edge LLMs: Benefits, Challenges, and Solutions

This blog post discusses the key factors to consider when deploying models on edge devices. We emphasize the significant hardware constraints of these devices, and identify techniques to efficiently utilize local hardware resources - quantization, low-rank adapters, and real-time parameter offloading from storage.

Authors

Andrew Greene, Kamil Rocki, Tomas Figliolia, Travis Oliphant, Beren Millidge

How investors can help solve AI’s energy problem

Investors need to think outside the box when it comes to addressing artificial intelligence’s energy problem.

How investors can help solve AI’s energy problem

Investors need to think outside the box when it comes to addressing artificial intelligence’s energy problem.

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory.

Authors

Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory.

Authors

Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge

Zyphra Launches Zamba2

An Efficient And Faster Small Language Model

Zyphra Launches Zamba2

An Efficient And Faster Small Language Model

Zamba2-small (2.7B)

Zyphra is excited to release Zamba2-small, a 2.7B state-of-the-art (SOTA) small language model for on-device applications.

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Beren Millidge

Zamba2-small (2.7B)

Zyphra is excited to release Zamba2-small, a 2.7B state-of-the-art (SOTA) small language model for on-device applications.

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Beren Millidge

Zyphra's Zyda

A 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv

Zyphra's Zyda

A 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv

Zyphra Debuts Zyda

An LLM training dataset with 1.3T tokens

Zyphra Debuts Zyda

An LLM training dataset with 1.3T tokens

Zyda

Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling.

Authors

Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, James Whittington, Quentin Anthony

Zyda

Zyphra is pleased to announce Zyda, a 1.3T trillion-token open dataset for language modeling.

Authors

Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, James Whittington, Quentin Anthony

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval.

Authors

Nick Alonso, Tomás Figliolia, Anthony Ndirango, Beren Millidge

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval.

Authors

Nick Alonso, Tomás Figliolia, Anthony Ndirango, Beren Millidge

Zyphra Releases Zamba

An SSM-hybrid foundation model to bring AI to more devices

Zyphra Releases Zamba

An SSM-hybrid foundation model to bring AI to more devices

Zamba

Zyphra is proud to release Zamba, a novel 7B parameter foundation model.

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

Zamba

Zyphra is proud to release Zamba, a novel 7B parameter foundation model.

Authors

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

The Unreasonable Ineffectiveness of the Deeper Layers

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.

Authors

Paolo Glorioso, Beren Millidge

The Unreasonable Ineffectiveness of the Deeper Layers

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.

Authors

Paolo Glorioso, Beren Millidge

What To Make of OpenAI’s New Board

The Startup Tackling Karpathy’s Vision

What To Make of OpenAI’s New Board

The Startup Tackling Karpathy’s Vision

Zyphra Open-Sources BlackMamba

A Novel Architecture that Combines the Mamba SSM with MoE to Obtain the Benefits of Both

Zyphra Open-Sources BlackMamba

A Novel Architecture that Combines the Mamba SSM with MoE to Obtain the Benefits of Both

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

Zyphra’s NeuraNoC is a pioneering packet-switched network-on-chip (NoC), named for its routing mechanism that resembles the spiking behavior of neurons in the brain by encoding processor connections as Bernoulli processes.

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

Zyphra’s NeuraNoC is a pioneering packet-switched network-on-chip (NoC), named for its routing mechanism that resembles the spiking behavior of neurons in the brain by encoding processor connections as Bernoulli processes.

Zyphra Blog

Introducing ZR1-1.5B, a small but powerful math+code reasoning model

Beta Release of Zonos-v0.1

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Zamba2-7B

Reaching 1B Context Length with RAG

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

Zamba2-mini (1.2B)

The Zyphra Training Cookbook

Understanding Graph-based RAG and Multi-Hop Question Answering

Edge LLMs: Benefits, Challenges, and Solutions

The Unreasonable Ineffectiveness of the Deeper Layers

How investors can help solve AI’s energy problem

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Zyphra Launches Zamba2

Zyphra's Zyda

Zyphra Debuts Zyda

Zyphra Releases Zamba

What To Make of OpenAI’s New Board

Zyphra Open-Sources BlackMamba

Zamba2-small (2.7B)

Zyda

Zamba

Browse More News

Introducing ZR1-1.5B, a small but powerful math+code reasoning model

Introducing ZR1-1.5B, a small but powerful math+code reasoning model

Beta Release of Zonos-v0.1

Beta Release of Zonos-v0.1

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

The Mixture-of-PageRanks Retriever for Long-Context Pre-Processing

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

Frontier Training Kernels for Transformers (FA2) and SSMs (Mamba2) on AMD Instinct MI300X Accelerators

Reaching 1B Context Length with RAG

Reaching 1B Context Length with RAG

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator

Zamba2-7B

Zamba2-7B

Zamba2-mini (1.2B)

Zamba2-mini (1.2B)

The Zyphra Training Cookbook

The Zyphra Training Cookbook

Understanding Graph-based RAG and Multi-Hop Question Answering

Understanding Graph-based RAG and Multi-Hop Question Answering

Edge LLMs: Benefits, Challenges, and Solutions

Edge LLMs: Benefits, Challenges, and Solutions

How investors can help solve AI’s energy problem

How investors can help solve AI’s energy problem

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Zyphra Launches Zamba2

Zyphra Launches Zamba2

Zamba2-small (2.7B)

Zamba2-small (2.7B)

Zyphra's Zyda

Zyphra's Zyda

Zyphra Debuts Zyda

Zyphra Debuts Zyda

Zyda

Zyda

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Zyphra Releases Zamba

Zyphra Releases Zamba

Zamba

Zamba

The Unreasonable Ineffectiveness of the Deeper Layers

The Unreasonable Ineffectiveness of the Deeper Layers

What To Make of OpenAI’s New Board

What To Make of OpenAI’s New Board

Zyphra Open-Sources BlackMamba

Zyphra Open-Sources BlackMamba

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

NeuraNoC - A neuroscience-inspired packet switch network-on-chip (NoC)

No results found.