Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang; Lei Ke; Ruihan Yang; Qi Gao; Tianyuan Qu; Rossell Chen; Dong Yu; Leoweiliang

arXiv:2603.06569·cs.CV·March 17, 2026

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang

PDF

Open Access 5 Models 2 Datasets

TL;DR

Penguin-VL introduces a vision encoder initialized from a text-only LLM, significantly improving visual fidelity and reasoning in compact VLMs without relying on large-scale contrastive pretraining.

Contribution

The paper presents Penguin-VL, a novel approach that replaces contrastive pretraining with text-only LLM initialization for vision encoders, enhancing performance and efficiency.

Findings

01

Penguin-VL matches or exceeds state-of-the-art VLMs in various benchmarks.

02

It outperforms contrastively pretrained encoders in preserving fine-grained cues.

03

Achieves high performance with lightweight architecture, suitable for resource-constrained devices.

Abstract

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications