X-VILA: Cross-Modality Alignment for Large Language Model

Hanrong Ye; De-An Huang; Yao Lu; Zhiding Yu; Wei Ping; Andrew Tao; Jan; Kautz; Song Han; Dan Xu; Pavlo Molchanov; Hongxu Yin

arXiv:2405.19335·cs.CV·May 30, 2024·1 cites

X-VILA: Cross-Modality Alignment for Large Language Model

Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan, Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

PDF

Open Access

TL;DR

X-VILA is a versatile multi-modal model that extends large language models to understand and generate across image, video, and audio modalities, using novel alignment techniques and a curated dataset.

Contribution

The paper introduces X-VILA, a cross-modality alignment framework for LLMs, with a new visual embedding highway and an efficient training recipe, enabling proficient multi-modal interactions.

Findings

01

Outperforms previous multi-modal models significantly.

02

Demonstrates emergent cross-modal capabilities without extensive training data.

03

Addresses visual information loss with a novel visual alignment mechanism.

Abstract

We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsDiffusion