Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Davide Talon, Federico Girella, Ziyue Liu, Marco Cristani, Yiming, Wang

TL;DR
This paper uncovers the significant presence of abstract language in vision-language models, especially in fashion, and introduces a training-free method to enhance their understanding of abstract concepts, improving retrieval performance.
Contribution
It reveals the importance of abstract language in VLMs and proposes ACT, a novel, training-free, model-agnostic method to better represent abstract concepts in the latent space.
Findings
Abstract terms are prevalent and valuable in fashion VLM datasets.
Current VLMs lack sufficient abstract language understanding due to training data limitations.
ACT improves retrieval performance across various models without additional training.
Abstract
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsFocus
