Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Dong Yu

TL;DR
Router-Tuning introduces a lightweight fine-tuning method for dynamic-depth transformers, significantly reducing training costs and maintaining high performance while improving computational efficiency through a novel attention mechanism.
Contribution
The paper presents Router-Tuning, a novel approach that fine-tunes only the router component, and MindSkip, an attention mechanism that preserves performance with dynamic depths, addressing key challenges in MoD methods.
Findings
21% speedup in computation
Only 0.2% performance drop
Reduced training costs by fine-tuning routers
Abstract
Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence
