Unveiling Encoder-Free Vision-Language Models

Haiwen Diao; Yufeng Cui; Xiaotong Li; Yueze Wang; Huchuan Lu; Xinlong; Wang

arXiv:2406.11832·cs.CV·October 30, 2024·3 cites

Unveiling Encoder-Free Vision-Language Models

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong, Wang

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces EVE, an encoder-free vision-language model that uses a unified decoder and extra supervision to efficiently learn visual and language representations, rivaling traditional encoder-based models.

Contribution

It presents a simple training recipe for pure encoder-free VLMs, demonstrating competitive performance with less complexity and more transparency.

Findings

01

EVE achieves comparable results to encoder-based models on multiple benchmarks.

02

Encoder-free training with the proposed strategies is efficient and effective.

03

EVE outperforms similar-sized models like Fuyu-8B significantly.

Abstract

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baaivision/eve
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies

MethodsSparse Evolutionary Training