Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning   Instruction Using Language Model

Wenqi Zhang; Zhenglin Cheng; Yuanyu He; Mengna Wang; Yongliang Shen,; Zeqi Tan; Guiyang Hou; Mingqian He; Yanna Ma; Weiming Lu; Yueting Zhuang

arXiv:2407.07053·cs.CV·October 4, 2024

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen,, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a synthetic benchmark for abstract image understanding and visual reasoning, revealing limitations of current large multimodal models and demonstrating improved performance through fine-tuning with synthetic data.

Contribution

It creates a large synthetic multimodal benchmark for abstract images and visual reasoning, and shows how fine-tuning improves model performance on these tasks.

Findings

01

Benchmark exposes shortcomings of LMMs in abstract reasoning

02

Fine-tuning with synthetic data improves chart and map understanding

03

Synthetic data benefits other visual reasoning tasks

Abstract

Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. \textbf{This benchmark, constructed with simple lines and geometric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zwq2018/multi-modal-self-instruct
noneOfficial

Datasets

zwq2018/Multi-modal-Self-instruct
dataset· 547 dl
547 dl

Videos

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications