Causal Graphical Models for Vision-Language Compositional Understanding
Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo, Baraldi, Rita Cucchiara

TL;DR
This paper introduces a causal graphical model approach to improve vision-language models' understanding of compositional language, significantly enhancing performance on benchmark tasks by focusing on causal dependencies.
Contribution
The paper proposes a novel causal graphical model framework that captures dependency relations in vision-language tasks, outperforming existing methods on multiple benchmarks.
Findings
Significant performance improvements on five compositional benchmarks.
Outperforms state-of-the-art approaches by a large margin.
Effective in learning causal dependencies, reducing spurious correlations.
Abstract
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding…
Peer Reviews
Decision·ICLR 2025 Poster
1. The author propose a causal graphical model with dependency-constraint decoding method to incorporate compositional knowledge into VLM. Using relationship representations could effectively model the compositionally of the data and resolve the long-existing issue for sequential modeling VLM that the model treats data as bag-of-words. 2. Experiments on several vision-language compositional benchmarks show that the proposed method is effective in improving the compositional performance. Specific
1. The training objective might not be robust to mislabeled data, thus posing challenging to strict data cleaning when future work consider scaling up. 2. The author did not compare the efficiency in terms of training nor inference of the proposed method with other baselines. 3. The evaluation focuses on retrieval, where the model is tasked to find the correct caption for a given image and vice versa. It remains unclear how to improve the compositional generation ability, which is of a greater n
- This paper is easy to read, with extensive empirical results and useful qualitative examples. - Motivation: using dependency/syntactic trees to capture the grammatical structure of sentences is well motivated -- this hierarchical structure shows how words depend on one another, which also may help to capture the compositional meaning of sentences. In vision-language models, capturing compositional meaning impacts on how different parts of a sentence relate to one another. - This paper show
- Generalization: One downside of this work is the overreliance on CLIP visual encoder as the feature extractor -- recent work has shown stronger performance using stronger baselines (e.g., average on SugarCrepe Swap -> CLIP acc. 64.25% vs. LLaVA acc. 81.25% -- using Contrastive Region Guidance [2] LLaVA acc. 90.75% -- results reported using CLIP+CGMs acc. is 83.14%). While X-VLM focuses on fine-grained alignment via cross-attention layers (for interactions between image and text features at mul
- small training, small decoder, great results - seems to be useful for CR checking, although not tested, can potentially aid in slow-fast type approaches to retrieval tasks (akin how original BLIP v1 did it) - might be a good idea to test, would make the contribution stronger I think - contains adequate ablation
- no comparison to decoder LMMs (all the LLM-alignment llava-style methods) - those afaik perform strongly on those old CR benchmarks, especially with CapPa style inference, but multiple choice generation (answer with single letter, etc) also works well - could test with CLIP-blind benchmark like eyes-wide-shut - no applications beyond CR - eg to improve retrieval using eg slow-fast approach, where CLIP retrieves first and the proposed method (being early fusion it is heavier) filters - It is no
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
