TABLE OF CONTENTS

Training Transformers and Hybrid models Flash Attention Mamba-2

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work Acknowledgements

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work Acknowledgements

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work

Training Transformers and Hybrid models Flash Attention Mamba-2

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work Acknowledgements

Training Transformers and Hybrid models Flash Attention Flash Attention

Training Transformers and Hybrid models Flash Attention Mamba-2 Future Work Acknowledgements

Training Transformers and Hybrid models

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.

The MI300X also has more compute hardware at its disposal, with a significantly greater number of streaming multiprocessors (SMs) than the H100. While this leads to incredibly high theoretical BFLOAT16 throughput, there are some caveats in practice that we discuss below.

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

This is primarily because, while the MI300X hardware is theoretically very powerful, actually materializing that performance in practice needs additional work due to the nascency of the surrounding software stack compared to NVIDIA’s CUDA ecosystem.

The need for any potential MI300X user is to pull this theoretical performance out of the underlying hardware, since many of the kernels inherent to the modern AI ecosystem were built and optimized with implicit (or explicit) assumptions around NVIDIA hardware.

What this means in practice is that many kernels for core operations required in pretraining either do not exist, or are poorly optimized compared to their H100 counterparts, and this negates the fundamental performance advantages of the MI300x hardware.

While AMDs ROCm software stack provides HIPification tools which enable a degree of portability from CUDA to HIP, actually matching or exceeding the performance of a CUDA-optimized kernel in HIP requires significant additional and specialized work.

Kernels are split into the forward pass and the backward pass. The forward pass is significantly simpler than backward, which is partially due to activation recomputation. Many AMD kernels support inference but may lack optimized backward passes needed for training.

For instance, the ROCm port of Flash Attention v2 has a highly optimized implementation of the forward pass that slightly beats the NVIDIA H100 in throughput, but the backwards pass still needs more optimization.

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

At Zyphra, we want to push the boundaries of hardware efficiency so that we can continue training frontier foundation models such as our Zamba2 series of models at a lower cost than our competitors. This is where AMD solutions come into play.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

The result of this optimization work is that we achieved speedups of 1%, 2%, and 4% for sequence lengths of 2k, 4k, and 8k, respectively for the FA2 backward kernel on MI300X compared to the H100. Cache thrashing, data movement cost, and SM utilization are all significantly improved compared to the initial MI300X port baseline during the attention backward pass when investigated via the ROCm profiling suite. With our attention backwards kernel in-hand, attention-based models such as dense/MoE transformers are trainable efficiently on MI300X hardware and can achieve high MFU.

Training Transformers and Hybrid models

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

For exact comparisons, see the table below:

Flash Attention

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Acknowledgements

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

While FlashAttention2-backward is a fundamental kernel necessary for training any transformer models today, at Zyphra we also have been innovating on designing novel model architectures which can achieve both higher quality and significantly greater inference and training performance over vanilla transformers.

Specifically, we have been designing and optimizing architectures based around hybrids of transformers and state-space-models (SSMs) such as Mamba. We find that such hybrids significantly outperform transformers both in terms of quality (loss per parameter) and also inference efficiency.

However, this architecture necessitates additional effort to port and optimize Mamba-2 kernels to the AMD ecosystem. Mamba2 kernels currently exist in Triton but are optimized for NVIDIA H100 GPUs. The same hardware properties we discussed above in the Flash Attention section hold for Mamba-2 backward kernel development.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Future Work

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Similar to FA2, we achieve speedups on the Mamba2 backward kernel of 4%, 5%, and 6% for sequence lengths of 2k, 4k, and 8k, respectively on MI300X compared to the H100. Cache thrashing, data movement cost, and SM utilization are all significantly improved. With both the Mamba2 forwards and backwards and the Flash Attention 2 backwards kernel in-hand, pure-SSM and hybrid attention/SSM models are trainable on MI300X hardware, and can achieve higher FLOPs per dollar than is possible on NVIDIA H100 systems.

As future work, we plan to extend the attention kernel and portions of the Mamba2 kernel to fp8 precision, and enable fine-grained tensor-parallel overlap within the Mamba2, Attention, and MLP blocks with communication. Both optimizations are critical to Zyphra’s training pipeline.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Mamba-2

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Future Work

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Mamba-2

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Flash Attention

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Mamba-2

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Future Work

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Link to Cookbook (GitHub)

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Future Work

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

What is Annealing?

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

Future Work

Acknowledgements

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Flash Attention

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Mamba-2

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Mamba-2

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Future Work

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Acknowledgements

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Reported scores underlined.

Pass@1 scores with greedy sampling.

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Evals (reported underlined). All numbers pass@1 estimated using n=16

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Flash Attention

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Mamba-2

Future Work

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Acknowledgements

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Flash Attention

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Mamba-2

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Future Work

Acknowledgements

Zyphra would like to thank TensorWave, Cirrascale, and AMD for providing us with access to AMD Instinct MI300X GPU accelerators to carry out the optimizations in this work.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.

Training Transformers and Hybrid models

For exact comparisons, see the table below:

Flash Attention

Despite these impressive specs, however, the Nvidia H100/H200 remains widely adopted for large-scale pretraining runs (except Google and Apple, who use TPUs).

These three properties:

AMD Instinct MI300X accelerator's beefy hardware, which translates to performance potential
Kernels being easy to port but hard to optimize
Forward-pass kernels being easier to write and optimize, and using the “free lunch” of higher memory bandwidth

have led to the “train on H100s and infer on MI300X” strategy that many organizations have started to take today.

Therefore, we have been working over the last few months to write and optimize the component backwards kernels that are necessary for our hybrid models (Mamba2 and Flash Attention v2).

Mamba-2

To discuss the Flash Attention v2 (FA2) kernel writing process we need to discuss the hardware properties of MI300X and how they affect FA2.

First, despite the abundance of compute available on MI300X, the throughput for compute-bound BF16/FP16 GEMMs is roughly on par with H100. The exact GEMM throughput on both architectures is strongly determined by the size of the component matrices which make up the GEMM. This translates to different attention block sizes (hidden dimension and number of attention heads) being optimal for each hardware. Sizing aside, we believe the peak real-world throughput of the MI300X will continue to increase over time
Second, the SRAM constraints are very different compared to H100s. MI300X GPUs have 64KB of local data share (LDS) per SM, whereas H100s have 256KB shared memory per SM (split between L1 cache and local memory, only up to 228KB is programmable). Since flash is predicated on fitting blocks as large as possible into SRAM, this makes larger head dimensions difficult on the MI300X.
The threads per warp is 64 on MI300X, but 32 on H100, which requires one to rethink work partitioning on the MI300X.
Memory bandwidth is significantly higher for MI300X. We didn’t use explicitly redesign our kernel around higher memory bandwidth, but it does improve the speed of moving blocks between HBM and SRAM.

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Future Work

In addition to the full restructuring, we baked in the following AMD-specific optimizations when writing the FA2 backward kernel:

Redo work partitioning to account for the new warp size of 64 threads. Ensure shared memory reads and writes are minimized for each block.
Tune block sizes and kernel launch parameters to account for the reduced 64KB LDS constraint, and to reduce cache thrashing.
Minimized shared memory and cache conflicts via restructured MI300X-specific memory layout
Split each block across more warps to further reduce shared memory read/writes, since SMs are more abundant.

Acknowledgements

We baked in the following AMD-specific optimizations when writing the Mamba-2 backward kernel:

Improved shared memory movement and access by arranging the smem_delta_a, smem_running_postfix, smem_da, smem_dbc buffers to better align with MI300X hardware, and within 64KB.
Modified prefix scan and reverse scan operations to the MI300X wavefront size. Perform tuning of block sizes and kernel launch parameters similar to FA2 to be optimal on MI300X.
Rework the block-exchange for atomicAdds using patterns more amenable to MI300X while maintaining coalescing.