Emergent Visual-Semantic Hierarchies in Image-Text Representations

Morris Alper; Hadar Averbuch-Elor

arXiv:2407.08521·cs.CV·July 17, 2024

Emergent Visual-Semantic Hierarchies in Image-Text Representations

Morris Alper, Hadar Averbuch-Elor

PDF

Open Access 1 Repo

TL;DR

This paper reveals that existing vision-and-language foundation models inherently understand visual-semantic hierarchies, and introduces methods and benchmarks to probe and enhance this emergent hierarchical knowledge.

Contribution

The study uncovers emergent hierarchical understanding in foundation models and proposes the Radial Embedding framework and HierarCaps dataset for probing and improving this capability.

Findings

01

Foundation models exhibit zero-shot hierarchical understanding.

02

HierarCaps dataset enables benchmarking of hierarchical knowledge.

03

Fine-tuning improves hierarchical reasoning without losing pretraining knowledge.

Abstract

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TAU-VAILab/hierarcaps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training