Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
Qingwei Lin

TL;DR
This study empirically compares auxiliary loss interactions in conditional depth routing, revealing that certain auxiliary supervision methods can hinder training efficiency and model performance.
Contribution
It systematically evaluates gate designs and auxiliary loss interactions, uncovering how off-policy labels affect training dynamics in conditional depth models.
Findings
G3 gate improves early optimization with util/rank supervision.
Removing util/rank improves LM performance and training speed.
The off-policy oracle label causes util/rank to be net-negative under current settings.
Abstract
Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them -- particularly between a predictive auxiliary and explicit score supervision -- have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
