UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
Yifan Wang, Yun Fu

TL;DR
UnAC enhances multimodal reasoning in large multimodal models by adaptive visual prompts, image abstraction, and stepwise self-checking, leading to improved accuracy on complex tasks.
Contribution
The paper introduces UnAC, a novel multimodal prompting framework combining adaptive visual focus, image abstraction, and self-verification for better reasoning.
Findings
Improved performance on MathVista, MM-Vet, and MMMU benchmarks.
Effective focus on salient image regions enhances understanding.
Self-checking scheme reduces reasoning errors.
Abstract
Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
