HoVLE: Unleashing the Power of Monolithic Vision-Language Models with   Holistic Vision-Language Embedding

Chenxin Tao; Shiqian Su; Xizhou Zhu; Chenyu Zhang; Zhe Chen; Jiawen; Liu; Wenhai Wang; Lewei Lu; Gao Huang; Yu Qiao; Jifeng Dai

arXiv:2412.16158·cs.CV·February 11, 2025

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen, Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai

PDF

Open Access 2 Models

TL;DR

HoVLE introduces a holistic embedding module for monolithic vision-language models, enabling effective processing of visual and textual data in a shared space, and achieves performance close to leading models.

Contribution

This paper proposes a novel holistic embedding module and multi-stage training strategy for monolithic VLMs, significantly improving their performance without degrading language capabilities.

Findings

01

Achieves near state-of-the-art results on multiple benchmarks.

02

Outperforms previous monolithic VLMs by a large margin.

03

Effectively aligns visual and textual embeddings in a shared space.

Abstract

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsALIGN