DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He; Song Chen; Yi Kang

arXiv:2605.11683·cs.CV·May 13, 2026

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He, Song Chen, Yi Kang

PDF

TL;DR

DORA introduces a reinforcement learning-based online framework for dynamic token merging in Vision Transformers, enhancing efficiency while maintaining accuracy across various scales and datasets.

Contribution

It is the first RL-driven online inference method for adaptive token merging in ViTs, optimizing the process via a sequential MDP and asymmetric Actor-Critic architecture.

Findings

01

Up to 12.66% token merging rate with negligible accuracy loss.

02

Achieves 569.7% relative efficiency improvement over baseline.

03

Outperforms state-of-the-art methods in computational savings on ImageNet and OOD benchmarks.

Abstract

Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.