Visual Lexicon: Rich Image Features in Language Space

XuDong Wang; Xingyi Zhou; Alireza Fathi; Trevor Darrell; Cordelia; Schmid

arXiv:2412.06774·cs.CV·December 10, 2024

Visual Lexicon: Rich Image Features in Language Space

XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia, Schmid

PDF

Open Access

TL;DR

Visual Lexicon introduces a novel visual language encoding rich image details into text space, enabling high-fidelity image reconstruction and improved vision-language tasks without fine-tuning, bridging visual and linguistic representations.

Contribution

It presents ViLex, a self-supervised method to encode detailed visual information into language tokens, enhancing image reconstruction and vision-language model performance.

Findings

01

Achieves higher fidelity in image reconstruction with a single token.

02

Enables zero-shot DreamBooth tasks without fine-tuning.

03

Improves performance across 15 vision-language benchmarks.

Abstract

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLexicography and Language Studies

MethodsDiffusion