mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

Yongyi Yang; Jianyang Gao

arXiv:2601.05732·cs.LG·January 12, 2026

mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

Yongyi Yang, Jianyang Gao

PDF

Open Access

TL;DR

mHC-lite introduces a reparameterization for hyper-connections in neural networks that guarantees exact doubly stochastic matrices, improving training stability and efficiency without specialized hardware.

Contribution

It proposes mHC-lite, a novel method that constructs doubly stochastic matrices explicitly, avoiding approximation errors and hardware dependencies of previous approaches.

Findings

01

mHC-lite matches or exceeds mHC performance.

02

It achieves higher training throughput.

03

It eliminates residual instabilities in training.

Abstract

Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning