Libra: Building Decoupled Vision System on Large Language Models

Yifan Xu; Xiaoshan Yang; Yaguang Song; Changsheng Xu

arXiv:2405.10140·cs.CV·May 17, 2024

Libra: Building Decoupled Vision System on Large Language Models

Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

PDF

Open Access 1 Repo 3 Models

TL;DR

Libra introduces a decoupled vision system integrated with a large language model, enabling effective cross-modal understanding with minimal training data, and offers a new approach for multimodal foundation models.

Contribution

We propose Libra, a novel decoupled vision system on LLMs that improves cross-modal comprehension with a lightweight training process.

Findings

01

Achieves strong baseline performance in image-to-text tasks.

02

Rivals existing multimodal models with only 50 million training data.

03

Demonstrates effective decoupling of inner-modal and cross-modal processing.

Abstract

In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yifanxu74/libra
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Handwritten Text Recognition Techniques