DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
Kaixuan He, Song Chen, Yi Kang

TL;DR
DORA introduces a reinforcement learning-based online framework for dynamic token merging in Vision Transformers, enhancing efficiency while maintaining accuracy across various scales and datasets.
Contribution
It is the first RL-driven online inference method for adaptive token merging in ViTs, optimizing the process via a sequential MDP and asymmetric Actor-Critic architecture.
Findings
Up to 12.66% token merging rate with negligible accuracy loss.
Achieves 569.7% relative efficiency improvement over baseline.
Outperforms state-of-the-art methods in computational savings on ImageNet and OOD benchmarks.
Abstract
Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
