Sparkles: Unlocking Chats Across Multiple Images for Multimodal   Instruction-Following Models

Yupan Huang; Zaiqiao Meng; Fangyu Liu; Yixuan Su; Nigel; Collier; Yutong Lu

arXiv:2308.16463·cs.CV·September 18, 2024·2 cites

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel, Collier, Yutong Lu

PDF

Open Access 1 Repo

TL;DR

Sparkles introduces a new multimodal dialogue dataset and benchmark, enabling instruction-following models to better understand and converse across multiple images without losing single-image performance.

Contribution

The paper presents SparklesDialogue and SparklesEval datasets, and SparklesChat, a multimodal model trained on these resources for improved multi-image dialogue understanding.

Findings

01

Enhanced multi-image dialogue comprehension in SparklesChat

02

Maintains single-image understanding capabilities

03

Resources are publicly available for further research

Abstract

Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hypjudy/sparkles
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling