See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang; Yunhan Tang; Yijie Wang; Zhihao Yuan; Huan Wang; Yangfan He; Bin Li

arXiv:2507.17659·cs.CV·August 14, 2025

See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang, Yangfan He, Bin Li

PDF

Open Access

TL;DR

Synergos-VQA introduces a multi-evidence reasoning framework for KBVQA that fuses holistic, structural, and causal evidence streams, leading to state-of-the-art results and improved robustness in visual question answering.

Contribution

The paper presents a novel synergistic reasoning framework that concurrently generates and fuses multiple evidence streams, enhancing reasoning capabilities beyond uni-dimensional evidence reliance.

Findings

01

Achieves new state-of-the-art on OK-VQA and A-OKVQA benchmarks.

02

Significantly boosts performance of open-source MLLMs.

03

Demonstrates strong plug-and-play capabilities across models.

Abstract

Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This "seeing only the trees, but not the forest" approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the "forest"), (2) Structural Evidence from a prototype-driven module to identify key objects (the "trees"), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling