Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen; Zhao Zhang; Weili Zeng; Richong Zhang; Feng Zhu; Rui Zhao

arXiv:2306.15195·cs.CV·July 4, 2023·71 cites

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao

PDF

Open Access 1 Repo 1 Models

TL;DR

Shikra is a simple, natural language-based multimodal LLM that can perform referential dialogue involving spatial regions, filling a key gap in current vision-language models and enabling diverse spatial tasks.

Contribution

This paper introduces Shikra, a straightforward multimodal LLM capable of handling spatial referential dialogue using natural language inputs and outputs, without extra modules or vocabularies.

Findings

01

Shikra achieves promising performance on spatial and vision-language tasks.

02

It can handle location-based tasks like REC and PointQA.

03

Enables applications like object coordinate reasoning and region similarity comparison.

Abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shikras/shikra
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques