Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

Hao Xiao

arXiv:2605.00837·cs.LG·May 5, 2026

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

Hao Xiao

PDF

TL;DR

FastSinkhorn is a CUDA-based implementation of the log-domain Sinkhorn algorithm that offers high GPU efficiency and numerical stability for large-scale optimal transport problems, outperforming existing libraries.

Contribution

The paper introduces a native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level reductions and shared memory for improved performance and stability.

Findings

01

Achieves 12x speedup over POT library on dense OT problems

02

Operates with regularization parameters as small as 10^{-4}

03

Uses only 256 MB GPU memory for large-scale problems

Abstract

Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.