Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions
Hao Xiao

TL;DR
FastSinkhorn is a CUDA-based implementation of the log-domain Sinkhorn algorithm that offers high GPU efficiency and numerical stability for large-scale optimal transport problems, outperforming existing libraries.
Contribution
The paper introduces a native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level reductions and shared memory for improved performance and stability.
Findings
Achieves 12x speedup over POT library on dense OT problems
Operates with regularization parameters as small as 10^{-4}
Uses only 256 MB GPU memory for large-scale problems
Abstract
Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
