ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao; Zhongtao Jiang; Yupu Hao; Yuqiao Tan; Shizhu He; Ben Wang; Jun Zhao; Kun Xu; Kang Liu

arXiv:2603.28610·cs.CV·April 1, 2026

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu

PDF

1 Repo

TL;DR

ResAdapt introduces an input-side adaptation framework for multimodal models that dynamically allocates visual processing resources, enabling efficient reasoning with significantly reduced visual input without sacrificing accuracy.

Contribution

It proposes a novel input adaptation method using a lightweight allocator trained with Cost-Aware Policy Optimization, improving efficiency in multimodal reasoning tasks.

Findings

01

Supports up to 16x more frames at the same visual budget.

02

Achieves over 15% performance gain on reasoning benchmarks.

03

Operates near the efficiency-accuracy frontier across tasks.

Abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Xnhyacinth/ResAdapt
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.