InternLM-XComposer: A Vision-Language Large Model for Advanced   Text-image Comprehension and Composition

Pan Zhang; Xiaoyi Dong; Bin Wang; Yuhang Cao; Chao Xu; Linke Ouyang,; Zhiyuan Zhao; Haodong Duan; Songyang Zhang; Shuangrui Ding; Wenwei Zhang,; Hang Yan; Xinyue Zhang; Wei Li; Jingwen Li; Kai Chen; Conghui He; Xingcheng; Zhang; Yu Qiao; Dahua Lin; Jiaqi Wang

arXiv:2309.15112·cs.CV·December 15, 2023·31 cites

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang,, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang,, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng, Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

PDF

Open Access 3 Repos 1 Models

TL;DR

InternLM-XComposer is a large vision-language model that advances text-image comprehension and composition, enabling seamless integration of images into text and achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper introduces InternLM-XComposer, a novel model capable of interleaved text-image composition and multilingual understanding, with a new evaluation procedure for text-image composition quality.

Findings

01

Achieves state-of-the-art results on multiple vision-language benchmarks.

02

Effectively generates coherent articles with integrated images based on instructions.

03

Demonstrates competitive performance with GPT4-V and GPT3.5 in text-image composition.

Abstract

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
DLight1551/internlm-xcomposer-vl-7b-qinstruct-full
model· 11 dl· ♡ 3
11 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications