Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru; Xiaochuang Han; Bhuwan Dhingra; Emily Dinan; Maha Elbayad

arXiv:2511.20770·cs.CV·November 27, 2025

Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

PDF

Open Access

TL;DR

The paper introduces TIE, a text-guided image encoder that produces task-specific image representations conditioned on input text, improving performance and efficiency in vision-language models across multiple benchmarks.

Contribution

TIE is a novel text-conditioned image encoder that enhances VLM performance and interpretability while reducing computational requirements.

Findings

01

Outperforms conventional encoders by +1.5 and +1.3 points on average across nine benchmarks.

02

Achieves up to 6-point gains on DocVQA and InfoVQA tasks.

03

Uses half as many image tokens, improving inference efficiency.

Abstract

Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications