Images are Worth Variable Length of Representations
Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang

TL;DR
DOVE introduces a dynamic vision encoder that generates variable-length representations tailored to image complexity, improving efficiency and semantic richness over fixed-length methods.
Contribution
It proposes DOVE, a novel variable-length tokenization approach for vision encoders, with query-conditioned extension for targeted semantic extraction.
Findings
Reduces average token count while maintaining high reconstruction quality.
Outperforms fixed-length encoding methods in downstream tasks.
Enhances semantic feature capture with fewer tokens.
Abstract
Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning
MethodsFocus
