Can an MLP Absorb Its Own Skip Connection?
Antonij Mijoski, Marko Karbevski

TL;DR
This paper investigates when skip connections in single-hidden-layer MLPs can be absorbed into residual-free models, revealing fundamental limitations based on activation functions and weight configurations.
Contribution
It provides a comprehensive analysis of the conditions under which skip connections can or cannot be absorbed into residual-free MLPs, extending results to deep networks.
Findings
Absorption impossible for homogeneous activations with degree not equal to 1.
Gated activations with differentiable gates at zero also cannot be absorbed.
For ReLU and GELU, absorption is non-generic and depends on specific weight conditions.
Abstract
We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree , such as ReLU and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with , including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of residual blocks using such activations cannot be replicated by any composition of residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
