TL;DR
ResAdapt introduces an input-side adaptation framework for multimodal models that dynamically allocates visual processing resources, enabling efficient reasoning with significantly reduced visual input without sacrificing accuracy.
Contribution
It proposes a novel input adaptation method using a lightweight allocator trained with Cost-Aware Policy Optimization, improving efficiency in multimodal reasoning tasks.
Findings
Supports up to 16x more frames at the same visual budget.
Achieves over 15% performance gain on reasoning benchmarks.
Operates near the efficiency-accuracy frontier across tasks.
Abstract
Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
