UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

Yifan Wang; Yun Fu

arXiv:2605.03950·cs.CV·May 6, 2026

UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

Yifan Wang, Yun Fu

PDF

TL;DR

UnAC enhances multimodal reasoning in large multimodal models by adaptive visual prompts, image abstraction, and stepwise self-checking, leading to improved accuracy on complex tasks.

Contribution

The paper introduces UnAC, a novel multimodal prompting framework combining adaptive visual focus, image abstraction, and self-verification for better reasoning.

Findings

01

Improved performance on MathVista, MM-Vet, and MMMU benchmarks.

02

Effective focus on salient image regions enhances understanding.

03

Self-checking scheme reduces reasoning errors.

Abstract

Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.