Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man; De-An Huang; Guilin Liu; Shiwei Sheng; Shilong Liu; Liang-Yan Gui; Jan Kautz; Yu-Xiong Wang; Zhiding Yu

arXiv:2505.23766·cs.CV·May 30, 2025

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

PDF

TL;DR

Argus introduces a visual attention grounding mechanism for multimodal large language models, significantly improving vision-centric reasoning and object grounding by leveraging object-centric visual chain-of-thought signals.

Contribution

The paper presents a novel visual attention grounding method that enhances vision-centric reasoning in multimodal models through explicit language-guided visual region engagement.

Findings

01

Argus outperforms existing models on reasoning and grounding benchmarks.

02

Explicit visual region engagement improves multimodal reasoning accuracy.

03

Design choices are validated through extensive analysis.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Focus