
TL;DR
This paper introduces Dynamic Latent Routing (DLR), a novel method for learning structured routing behaviors in models, outperforming prior baselines in low-data fine-tuning scenarios across multiple datasets.
Contribution
DLR is a new language-model post-training approach that jointly learns discrete latent codes, routing policies, and model parameters via dynamic search in a single stage.
Findings
DLR matches or outperforms supervised fine-tuning in low-data settings.
DLR achieves a mean gain of +6.6 percentage points across datasets.
Mechanistic analyses show DLR learns structured routing behaviors.
Abstract
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
