InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang,, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang,, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng, Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

TL;DR
InternLM-XComposer is a large vision-language model that advances text-image comprehension and composition, enabling seamless integration of images into text and achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper introduces InternLM-XComposer, a novel model capable of interleaved text-image composition and multilingual understanding, with a new evaluation procedure for text-image composition quality.
Findings
Achieves state-of-the-art results on multiple vision-language benchmarks.
Effectively generates coherent articles with integrated images based on instructions.
Demonstrates competitive performance with GPT4-V and GPT3.5 in text-image composition.
Abstract
We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
