Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika M\"utze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

TL;DR
This paper introduces a framework for analyzing and aligning the semantic hierarchies in vision-language model embeddings, revealing modality differences and trade-offs between accuracy and plausibility.
Contribution
It presents a novel post-hoc method to extract, verify, and align semantic hierarchies in VLM embeddings using clustering, ontology comparison, and embedding transformations.
Findings
Image encoders are more discriminative than text encoders.
Text encoders produce hierarchies that better match human taxonomies.
There is a trade-off between zero-shot accuracy and ontological plausibility.
Abstract
Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
