DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

TL;DR
DASH introduces novel scheduling strategies to optimize deterministic attention backward pass, significantly reducing performance overhead and enhancing reproducibility in large language model training.
Contribution
It formulates the deterministic attention backward pass as a DAG scheduling problem and proposes DASH with two strategies to improve throughput.
Findings
Up to 1.28× throughput improvement over baseline
Reduces the performance gap of deterministic attention
Enhances reproducibility without significant efficiency loss
Abstract
Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper addresses an important area of deterministic training that has been gaining a lot of traction especially with large model training, bridging the gap between the non deterministic and the deterministic version of the attention kernel, which serves as one of the fundamental pieces in the commonly used transformer architectures. 2. I especially like the in-depth analysis of when the theoretical model deviates from the on hardware execution results: both for scenarios with long context
My only concern is that given the motivating backward's schedule analysis presented in Section 3.2, I would have expected the deterministic attention baseline to have been more competitive with shift scheduling for the non causal mask scenario. Similarly, I would have expected the descending schedule to have been much better compared to baseline for the causal mask case with head size 64. That does not seem to be the case. Would it be possible for the authors to specify what might be the cause
- Novel kernel implementation for an important operation in deterministic Transformer training - Clear and novel DAG-based formalization of deterministic backward scheduling. - The two scheduling strategies are well-motivated, combining theory and practicality. - Empirical validation on modern GPUs with thorough analysis of full vs. causal masks. - Addresses an important reproducibility issue in large-scale deterministic training.
- There is performance degradation for long sequence length with full attention mask (Figure 8) - The theoretical model ignores some of the GPU implementation considerations, such as inter-SM communication overhead and register requirements (as mentioned in sections 4.2 and 4.3) - Focuses solely on the backward pass; potential extensions to the forward path are not explored. - Symmetric Shift Scheduling introduces significant register pressure, limiting practical benefits.
Thank you for submitting your work. - I'd like to praise the authors for their presentation. This was a pleasure to read. I was astonished how such a dense topic was explained so well in the text. - FlashAttention 3 is a strong baseline, and the NVIDIA H800 GPU serves as a typical setup. - The ideas are sound and the reason they occur is intuitive, though less so for shift scheduling.
- Though a minor point, I think there is merit to discussing the effect of determinism in other operations of the transformer. Matrix multiplications include reductions as well, correct? Does determinism reduce performance there as well? - I think the paper should include a table that shows what the end-to-end relative benefit is for **a whole transformer block**, not just the attention part. While there is value in making attention faster, it is hard to put these gains in perspective without se
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science
