mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
Yongyi Yang, Jianyang Gao

TL;DR
mHC-lite introduces a reparameterization for hyper-connections in neural networks that guarantees exact doubly stochastic matrices, improving training stability and efficiency without specialized hardware.
Contribution
It proposes mHC-lite, a novel method that constructs doubly stochastic matrices explicitly, avoiding approximation errors and hardware dependencies of previous approaches.
Findings
mHC-lite matches or exceeds mHC performance.
It achieves higher training throughput.
It eliminates residual instabilities in training.
Abstract
Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
