Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho; Bruno Martins

arXiv:2501.02584·cs.CV·November 21, 2025

Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho, Bruno Martins

PDF

Open Access 1 Repo

TL;DR

This paper introduces Pheye, an efficient high-resolution vision-language model architecture that reduces parameters and enhances fine-grained image understanding, especially in scene-text recognition tasks.

Contribution

The paper presents Pheye, a novel architecture that efficiently processes high-resolution images with fewer parameters, improving fine detail recognition in vision-language tasks.

Findings

01

Pheye achieves high efficiency with fewer parameters.

02

Pheye performs well in fine-grained image understanding tasks.

03

Pheye excels in scene-text recognition tasks.

Abstract

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

miguelscarv/pheye
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques