From the Least to the Most: Building a Plug-and-Play Visual Reasoner via   Data Synthesis

Chuanqi Cheng; Jian Guan; Wei Wu; Rui Yan

arXiv:2406.19934·cs.CL·October 14, 2024

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces a novel data synthesis method for multi-step visual reasoning, enabling the enhancement of vision-language models through a plug-and-play visual reasoner trained on automatically generated reasoning data.

Contribution

It proposes a least-to-most visual reasoning paradigm and a cost-effective data synthesis approach to generate multi-step reasoning examples for training visual reasoners.

Findings

01

Significant improvement in VQA benchmarks across multiple models.

02

Reproducible and cost-efficient data synthesis process.

03

Enhanced reasoning abilities in existing vision-language models.

Abstract

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

orange-sk/VisualReasoner-1M
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Semantic Web and Ontologies · Intelligent Tutoring Systems and Adaptive Learning