Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

Shwai He; Tao Ge; Guoheng Sun; Bowei Tian; Xiaoyang Wang; Dong Yu

arXiv:2410.13184·cs.CL·September 15, 2025

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Dong Yu

PDF

Open Access 1 Repo

TL;DR

Router-Tuning introduces a lightweight fine-tuning method for dynamic-depth transformers, significantly reducing training costs and maintaining high performance while improving computational efficiency through a novel attention mechanism.

Contribution

The paper presents Router-Tuning, a novel approach that fine-tunes only the router component, and MindSkip, an attention mechanism that preserves performance with dynamic depths, addressing key challenges in MoD methods.

Findings

01

21% speedup in computation

02

Only 0.2% performance drop

03

Reduced training costs by fine-tuning routers

Abstract

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

case-lab-umd/router-tuning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence