Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

Dylan Forde

arXiv:2605.08123·cs.LG·May 21, 2026

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

Dylan Forde

PDF

TL;DR

This paper introduces a block-wise differentiable Sinkhorn attention method optimized for TPU hardware, enabling efficient long-context entropic optimal transport with precise gradient computation and improved performance on biological sequence tasks.

Contribution

It presents a novel tail-refinement surrogate for Sinkhorn attention that allows exact differentiation, along with theoretical analysis and practical implementation on TPU hardware.

Findings

01

Exact gradient computation matches autodiff to within 10^{-10}

02

Achieves 8.5 examples/sec on TPU for long sequences

03

Improves reconstruction and sparse CE metrics on Pfam datasets

Abstract

We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$ -step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported $R = 2$ TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R = 2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O ((T + R) L W)$ , $O (L d)$ input storage, and $O (L)$ additional HBM usage for fixed head dimension $d$ and band width $W$ on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.