Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

Qingwei Lin

arXiv:2604.17228·cs.LG·April 21, 2026

Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

Qingwei Lin

PDF

TL;DR

This study empirically compares auxiliary loss interactions in conditional depth routing, revealing that certain auxiliary supervision methods can hinder training efficiency and model performance.

Contribution

It systematically evaluates gate designs and auxiliary loss interactions, uncovering how off-policy labels affect training dynamics in conditional depth models.

Findings

01

G3 gate improves early optimization with util/rank supervision.

02

Removing util/rank improves LM performance and training speed.

03

The off-policy oracle label causes util/rank to be net-negative under current settings.

Abstract

Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them -- particularly between a predictive auxiliary and explicit score supervision -- have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.