Libra: Building Decoupled Vision System on Large Language Models
Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

TL;DR
Libra introduces a decoupled vision system integrated with a large language model, enabling effective cross-modal understanding with minimal training data, and offers a new approach for multimodal foundation models.
Contribution
We propose Libra, a novel decoupled vision system on LLMs that improves cross-modal comprehension with a lightweight training process.
Findings
Achieves strong baseline performance in image-to-text tasks.
Rivals existing multimodal models with only 50 million training data.
Demonstrates effective decoupling of inner-modal and cross-modal processing.
Abstract
In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Handwritten Text Recognition Techniques
