Unveiling Encoder-Free Vision-Language Models
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong, Wang

TL;DR
This paper introduces EVE, an encoder-free vision-language model that uses a unified decoder and extra supervision to efficiently learn visual and language representations, rivaling traditional encoder-based models.
Contribution
It presents a simple training recipe for pure encoder-free VLMs, demonstrating competitive performance with less complexity and more transparency.
Findings
EVE achieves comparable results to encoder-based models on multiple benchmarks.
Encoder-free training with the proposed strategies is efficient and effective.
EVE outperforms similar-sized models like Fuyu-8B significantly.
Abstract
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
