Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang; Li Dong; Wenhui Wang; Yaru Hao; Saksham Singhal,; Shuming Ma; Tengchao Lv; Lei Cui; Owais Khan Mohammed; Barun Patra; Qiang; Liu; Kriti Aggarwal; Zewen Chi; Johan Bjorck; Vishrav Chaudhary; Subhojit; Som; Xia Song; Furu Wei

arXiv:2302.14045·cs.CL·March 2, 2023·164 cites

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal,, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang, Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit, Som, Xia Song, Furu Wei

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Kosmos-1 is a multimodal large language model trained on diverse web-scale data, capable of understanding and generating across language, vision, and perception tasks without fine-tuning, advancing towards artificial general intelligence.

Contribution

This work introduces Kosmos-1, a novel multimodal large language model trained from scratch on interleaved text and images, demonstrating strong zero-shot and few-shot performance across multiple modalities.

Findings

01

Achieves high performance on language understanding and generation tasks.

02

Excels in perception-language tasks like VQA and image captioning.

03

Shows effective cross-modal transfer of knowledge.

Abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm
pytorchOfficial

Datasets

lmms-lab/IQ50
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques