X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan, Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

TL;DR
X-VILA is a versatile multi-modal model that extends large language models to understand and generate across image, video, and audio modalities, using novel alignment techniques and a curated dataset.
Contribution
The paper introduces X-VILA, a cross-modality alignment framework for LLMs, with a new visual embedding highway and an efficient training recipe, enabling proficient multi-modal interactions.
Findings
Outperforms previous multi-modal models significantly.
Demonstrates emergent cross-modal capabilities without extensive training data.
Addresses visual information loss with a novel visual alignment mechanism.
Abstract
We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsDiffusion
