From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

TL;DR
This paper introduces a novel data synthesis method for multi-step visual reasoning, enabling the enhancement of vision-language models through a plug-and-play visual reasoner trained on automatically generated reasoning data.
Contribution
It proposes a least-to-most visual reasoning paradigm and a cost-effective data synthesis approach to generate multi-step reasoning examples for training visual reasoners.
Findings
Significant improvement in VQA benchmarks across multiple models.
Reproducible and cost-efficient data synthesis process.
Enhanced reasoning abilities in existing vision-language models.
Abstract
We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Semantic Web and Ontologies · Intelligent Tutoring Systems and Adaptive Learning
