TL;DR
This paper introduces AVR, an adaptive visual reasoning framework that reduces unnecessary reasoning steps in VRMs by dynamically selecting response formats, significantly decreasing token usage while maintaining accuracy.
Contribution
It proposes a novel adaptive reasoning approach with a new training method, improving efficiency in visual reasoning models compared to prior static methods.
Findings
AVR reduces token usage by 50-90% on benchmarks.
AVR maintains accuracy while decreasing reasoning complexity.
Adaptive reasoning mitigates overthinking in VRMs.
Abstract
Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
