Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge
Dylan Forde

TL;DR
This paper introduces a block-wise differentiable Sinkhorn attention method optimized for TPU hardware, enabling efficient long-context entropic optimal transport with precise gradient computation and improved performance on biological sequence tasks.
Contribution
It presents a novel tail-refinement surrogate for Sinkhorn attention that allows exact differentiation, along with theoretical analysis and practical implementation on TPU hardware.
Findings
Exact gradient computation matches autodiff to within 10^{-10}
Achieves 8.5 examples/sec on TPU for long sequences
Improves reconstruction and sparse CE metrics on Pfam datasets
Abstract
We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped -step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost , input storage, and additional HBM usage for fixed head dimension and band width on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
