ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring   Instruction Tuning

Liang Zhao; En Yu; Zheng Ge; Jinrong Yang; Haoran Wei; Hongyu Zhou,; Jianjian Sun; Yuang Peng; Runpei Dong; Chunrui Han; Xiangyu Zhang

arXiv:2307.09474·cs.CL·July 19, 2023·1 cites

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou,, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang

PDF

Open Access

TL;DR

ChatSpot introduces precise referring instructions and a unified multimodal LLM that supports diverse interactive inputs, enhancing the accuracy and flexibility of human-AI multimodal interactions.

Contribution

The paper presents a novel multimodal LLM, ChatSpot, capable of handling various interactive inputs with precise region referencing, advancing multimodal human-AI interaction capabilities.

Findings

01

ChatSpot achieves promising performance in region recognition.

02

The model supports diverse interactive forms like clicks, drag-and-drop, and drawing.

03

A new multi-grained vision-language dataset was constructed for training and evaluation.

Abstract

Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing