Compositional Entailment Learning for Hyperbolic Vision-Language Models

Avik Pal; Max van Spengler; Guido Maria D'Amely di Melendugno,; Alessandro Flaborea; Fabio Galasso; Pascal Mettes

arXiv:2410.06912·cs.CV·March 4, 2025

Compositional Entailment Learning for Hyperbolic Vision-Language Models

Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno,, Alessandro Flaborea, Fabio Galasso, Pascal Mettes

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel compositional entailment learning approach for hyperbolic vision-language models, leveraging hierarchical image and text structures to improve representation and generalization in image-text tasks.

Contribution

It proposes a new hierarchical learning method that fully exploits hyperbolic space for vision-language models, enhancing their ability to capture hierarchical relationships.

Findings

01

Outperforms Euclidean CLIP in zero-shot tasks

02

Achieves better hierarchical representation and retrieval performance

03

Enhances generalization in vision-language tasks

Abstract

Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

* The proposed method is simple and elegant and can be easily applied to large scale pretraining of vision-language models. The procedure to automatically generate paired image and text boxes is also relatively straightforward. * The empirical results show improvement across several tasks which demonstrates the improved representation learning - classification, retrieval, detection and understanding. * Table 1 results show that CLIP trained on additional image-text boxes doesn't improve the perf

Weaknesses

When training CLIP on additional image-text boxes shows no improvement (Table 1), it could be because there is limited new information in such examples (as original image-text pairs are already present in the training data). For a better understanding of this, an experiment such as this might help: split the GRIT dataset into 2 random subsets of 10M each. Then compare the results on the following settings: [1] CLIP trained on 10M image-text pairs [2] CLIP trained on 10M image-text pairs + addi

Reviewer 02Rating 8Confidence 4

Strengths

1. This paper is well-organized. The motivation is easy to follow, and the method is easy-to-understand. 2. The proposed HyCoCLIP is novel and effective. It organizes data at multiple abstraction levels, providing an inspiring approach to multi-modal learning. 3. The authors performs exhaustive experiments to show that the effectiveness of HyCoCLIP. It outperforms baselines on general and fine-grained image classification tasks.

Weaknesses

1. While the paper compare with CLIP and MERU, it should also compare some recently proposed VLMs. 2. The paper should explore how sensitive the model is to the choice of hyperbolic space parameters.

Reviewer 03Rating 8Confidence 3

Strengths

I think this paper is very well written and I find it easy to follow. Overall the idea behind HyCoCLIp is well motivated and I believe the authors have conducted sufficient experiments to empirically demonstrate the proposed method and model’s efficacy. The empirical performance of HyCoCLIP is very strong and to the best of my knowledge, the proposed HyCoCLIP achieved the state-of-results on many of the reported zero-shot tasks from a contrastive-pretrained model.

Weaknesses

One major concern is the incremental nature of this work. Hyperbolic embeddings for representing hierarchical relationships have been explored in previous models, and this paper primarily builds upon these established ideas. However, the specific contributions of HyCoCLIP, particularly in enhancing hierarchical and scene understanding tasks, offer sufficient merit to make this work valuable to the broader community.

Code & Models

Repositories

PalAvik/hycoclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training