Visual Lexicon: Rich Image Features in Language Space
XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia, Schmid

TL;DR
Visual Lexicon introduces a novel visual language encoding rich image details into text space, enabling high-fidelity image reconstruction and improved vision-language tasks without fine-tuning, bridging visual and linguistic representations.
Contribution
It presents ViLex, a self-supervised method to encode detailed visual information into language tokens, enhancing image reconstruction and vision-language model performance.
Findings
Achieves higher fidelity in image reconstruction with a single token.
Enables zero-shot DreamBooth tasks without fine-tuning.
Improves performance across 15 vision-language benchmarks.
Abstract
We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLexicography and Language Studies
MethodsDiffusion
