Compositional Entailment Learning for Hyperbolic Vision-Language Models
Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno,, Alessandro Flaborea, Fabio Galasso, Pascal Mettes

TL;DR
This paper introduces a novel compositional entailment learning approach for hyperbolic vision-language models, leveraging hierarchical image and text structures to improve representation and generalization in image-text tasks.
Contribution
It proposes a new hierarchical learning method that fully exploits hyperbolic space for vision-language models, enhancing their ability to capture hierarchical relationships.
Findings
Outperforms Euclidean CLIP in zero-shot tasks
Achieves better hierarchical representation and retrieval performance
Enhances generalization in vision-language tasks
Abstract
Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using…
Peer Reviews
Decision·ICLR 2025 Oral
* The proposed method is simple and elegant and can be easily applied to large scale pretraining of vision-language models. The procedure to automatically generate paired image and text boxes is also relatively straightforward. * The empirical results show improvement across several tasks which demonstrates the improved representation learning - classification, retrieval, detection and understanding. * Table 1 results show that CLIP trained on additional image-text boxes doesn't improve the perf
When training CLIP on additional image-text boxes shows no improvement (Table 1), it could be because there is limited new information in such examples (as original image-text pairs are already present in the training data). For a better understanding of this, an experiment such as this might help: split the GRIT dataset into 2 random subsets of 10M each. Then compare the results on the following settings: [1] CLIP trained on 10M image-text pairs [2] CLIP trained on 10M image-text pairs + addi
1. This paper is well-organized. The motivation is easy to follow, and the method is easy-to-understand. 2. The proposed HyCoCLIP is novel and effective. It organizes data at multiple abstraction levels, providing an inspiring approach to multi-modal learning. 3. The authors performs exhaustive experiments to show that the effectiveness of HyCoCLIP. It outperforms baselines on general and fine-grained image classification tasks.
1. While the paper compare with CLIP and MERU, it should also compare some recently proposed VLMs. 2. The paper should explore how sensitive the model is to the choice of hyperbolic space parameters.
I think this paper is very well written and I find it easy to follow. Overall the idea behind HyCoCLIp is well motivated and I believe the authors have conducted sufficient experiments to empirically demonstrate the proposed method and model’s efficacy. The empirical performance of HyCoCLIP is very strong and to the best of my knowledge, the proposed HyCoCLIP achieved the state-of-results on many of the reported zero-shot tasks from a contrastive-pretrained model.
One major concern is the incremental nature of this work. Hyperbolic embeddings for representing hierarchical relationships have been explored in previous models, and this paper primarily builds upon these established ideas. However, the specific contributions of HyCoCLIP, particularly in enhancing hierarchical and scene understanding tasks, offer sufficient merit to make this work valuable to the broader community.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
