List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie, Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

TL;DR
This paper introduces a new training paradigm called 'list items one by one' for multimodal large language models, using a curated dataset to improve visual reasoning and object-text alignment by leveraging visual tags.
Contribution
It proposes a novel learning paradigm and dataset to enhance visual grounding in open-source multimodal models, enabling better understanding of visual tags and reasoning.
Findings
Significant improvement in visual reasoning capabilities.
Reduction in hallucinations in multimodal models.
Enhancement of object-text alignment through training with visual tags.
Abstract
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Lexicography and Language Studies
MethodsSelf-Organizing Map
