ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
Zhaodong Wu, Haochen Xue, Qi Cao, Wenqi Mo, Yu Pei, Wenqi Xu, Jionglong Su, Yang Liu

TL;DR
ConFoThinking introduces a novel attention aggregation framework that enhances visual question answering by focusing on salient regions through consolidated attention, reducing noise and improving accuracy across multiple benchmarks.
Contribution
It proposes a new attention aggregation method that consolidates multi-layer attention signals and uses semantic cues for better localization in VQA tasks.
Findings
Significant performance improvements on five VQA benchmarks.
Effective aggregation of attention signals enhances localization accuracy.
Reduction of semantic noise improves reasoning in VQA models.
Abstract
Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
