Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He

TL;DR
LexVLA introduces a unified, interpretable lexical representation framework for visual-language alignment, leveraging pre-trained models and an overuse penalty to improve cross-modal retrieval with simpler training.
Contribution
It proposes a novel lexical representation approach that aligns visual and language models without complex training, enhancing interpretability and performance.
Findings
Outperforms larger dataset fine-tuned models on retrieval tasks
Uses pre-trained DINOv2 and Llama 2 for effective alignment
Employs overuse penalty to reduce false discoveries
Abstract
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · LLaMA
