HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen, Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai

TL;DR
HoVLE introduces a holistic embedding module for monolithic vision-language models, enabling effective processing of visual and textual data in a shared space, and achieves performance close to leading models.
Contribution
This paper proposes a novel holistic embedding module and multi-stage training strategy for monolithic VLMs, significantly improving their performance without degrading language capabilities.
Findings
Achieves near state-of-the-art results on multiple benchmarks.
Outperforms previous monolithic VLMs by a large margin.
Effectively aligns visual and textual embeddings in a shared space.
Abstract
The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsALIGN
