EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen; Dong Shen; Hanwen Zhong; Huasong Zhong; Kui Xia; Di Xu,; Wei Yuan; Yifei Hu; Bin Wen; Tianke Zhang; Changyi Liu; Dewen Fan; Huihui; Xiao; Jiahong Wu; Fan Yang; Size Li; Di Zhang

arXiv:2407.14177·cs.CV·July 22, 2024

EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu,, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui, Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

PDF

Open Access

TL;DR

EVLM introduces an efficient multi-modal vision-language model that reduces computational costs and enhances visual perception by using hierarchical features, cross-attention, and MoE, achieving competitive results on benchmarks.

Contribution

The paper presents a novel multi-modal model combining hierarchical ViT features, cross-attention, and MoE to improve efficiency and visual understanding in vision-language tasks.

Findings

01

Achieves competitive scores on multi-modal benchmarks.

02

Performs well in image and video captioning tasks.

03

Reduces computational overhead compared to existing models.

Abstract

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques