Unified Lexical Representation for Interpretable Visual-Language   Alignment

Yifan Li; Yikai Wang; Yanwei Fu; Dongyu Ru; Zheng Zhang; Tong He

arXiv:2407.17827·cs.CV·November 12, 2024

Unified Lexical Representation for Interpretable Visual-Language Alignment

Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He

PDF

Open Access 1 Repo 1 Video

TL;DR

LexVLA introduces a unified, interpretable lexical representation framework for visual-language alignment, leveraging pre-trained models and an overuse penalty to improve cross-modal retrieval with simpler training.

Contribution

It proposes a novel lexical representation approach that aligns visual and language models without complex training, enhancing interpretability and performance.

Findings

01

Outperforms larger dataset fine-tuned models on retrieval tasks

02

Uses pre-trained DINOv2 and Llama 2 for effective alignment

03

Employs overuse penalty to reduce false discoveries

Abstract

Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clementine24/lexvla
noneOfficial

Videos

Unified Lexical Representation for Interpretable Visual-Language Alignment· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · LLaMA