Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang; Jiting Cai; Mingyu Liu; Yue Xu; Cewu Lu; Yong-Lu Li

arXiv:2407.19666·cs.CV·July 30, 2024

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li

PDF

Open Access

TL;DR

This paper proposes a two-stage framework for visual reasoning, emphasizing separate symbolization and shared reasoning to improve cross-domain generalization across diverse visual tasks.

Contribution

It introduces a novel two-stage approach with separated encoders for symbolization and a shared reasoner, enhancing generalization in visual reasoning tasks.

Findings

01

Outperforms existing methods on multiple benchmarks

02

Demonstrates strong cross-domain generalization

03

Effective on both 2D and 3D visual tasks

Abstract

Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual and Cognitive Learning Processes · Intelligent Tutoring Systems and Adaptive Learning