Images are Worth Variable Length of Representations

Lingjun Mao; Rodolfo Corona; Xin Liang; Wenhao Yan; Zineng Tang

arXiv:2506.03643·cs.CV·June 6, 2025

Images are Worth Variable Length of Representations

Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang

PDF

Open Access

TL;DR

DOVE introduces a dynamic vision encoder that generates variable-length representations tailored to image complexity, improving efficiency and semantic richness over fixed-length methods.

Contribution

It proposes DOVE, a novel variable-length tokenization approach for vision encoders, with query-conditioned extension for targeted semantic extraction.

Findings

01

Reduces average token count while maintaining high reconstruction quality.

02

Outperforms fixed-length encoding methods in downstream tasks.

03

Enhances semantic feature capture with fewer tokens.

Abstract

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning

MethodsFocus