CoVLM: Composing Visual Entities and Relationships in Large Language   Models Via Communicative Decoding

Junyan Li; Delin Chen; Yining Hong; Zhenfang Chen; Peihao Chen; Yikang; Shen; Chuang Gan

arXiv:2311.03354·cs.CV·November 7, 2023·2 cites

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang, Shen, Chuang Gan

PDF

Open Access

TL;DR

CoVLM introduces a novel communicative decoding framework that enables large language models to explicitly compose visual entities and relationships through dynamic communication tokens, significantly improving compositional reasoning in vision-language tasks.

Contribution

The paper proposes a new communication-based approach for LLMs to explicitly model visual entities and relations, enhancing compositional reasoning capabilities.

Findings

01

Outperforms previous VLMs by ~20% in HICO-DET mAP

02

Achieves ~14% improvement in Cola top-1 accuracy

03

Sets new state-of-the-art in referring expression comprehension and visual question answering

Abstract

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training · COLA