Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering
Songtao Jiang, Chenyi Zhou, Yan Zhang, Yeying Jin, Zuozhu Liu

TL;DR
This paper introduces FOCUS, a dynamic approach inspired by Dual Process Theory, that improves visual question answering by selectively emphasizing key visual elements based on question complexity, leading to consistent performance gains.
Contribution
The paper presents FOCUS, a novel plug-and-play method that adaptively combines intuitive and analytical reasoning to enhance MLLMs in VQA tasks.
Findings
FOCUS improves performance across four benchmarks.
Selective visual prompting outperforms indiscriminate annotation.
Combining cognitive strategies yields significant accuracy gains.
Abstract
Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual and Cognitive Learning Processes · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus
