Back to Newsroom
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
August 7, 2024
PALO ALTO, CALIFORNIA

Zyphra is excited to announce Tree Attention, a novel method for efficiently parallelizing multi-GPU transformer decoding with significant advantages in speed and memory. For instance, we estimate that Tree Attention can decode at the 1M sequence length over 8x faster than existing Ring Attention while requiring  2x less communication volume or more. Moreover, Tree Attention achieves an asymptotic advantage over Ring Attention in the number of devices so the benefit increases dramatically for larger clusters.

For full information please see our paper and reference code implementation.

Authors
Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge
Collaborators
Daniel A Roberts (Sequoia Capital & MIT), Andrey Gromov (Meta FAIR), Kushal Tirumala (Meta FAIR) and Hassan Shapourian (Cisco)