InternLM-XComposer-2.5: A Versatile Large Vision Language Model   Supporting Long-Contextual Input and Output

Pan Zhang; Xiaoyi Dong; Yuhang Zang; Yuhang Cao; Rui Qian; Lin Chen,; Qipeng Guo; Haodong Duan; Bin Wang; Linke Ouyang; Songyang Zhang; Wenwei; Zhang; Yining Li; Yang Gao; Peng Sun; Xinyue Zhang; Wei Li; Jingwen Li,; Wenhai Wang; Hang Yan; Conghui He; Xingcheng Zhang; Kai Chen; Jifeng Dai; Yu; Qiao; Dahua Lin; Jiaqi Wang

arXiv:2407.03320·cs.CV·July 4, 2024·3 cites

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen,, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei, Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li,, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen

PDF

Open Access 1 Repo 3 Models

TL;DR

InternLM-XComposer-2.5 is a versatile large vision-language model capable of long-context understanding and multi-modal tasks, achieving GPT-4V level performance with a 7B backend and supporting applications like webpage creation and article composition.

Contribution

It introduces a new 7B vision-language model with long-context support, enhanced comprehension capabilities, and applications in text-image composition, outperforming previous models on multiple benchmarks.

Findings

01

Achieves GPT-4V level performance with only 7B parameters.

02

Supports up to 96K long contexts through RoPE extrapolation.

03

Outperforms existing open-source models on 16 benchmarks.

Abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

internlm/internlm-xcomposer
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques