Independent Density Estimation
Jiahao Liu, Senhao Cao

TL;DR
This paper introduces Independent Density Estimation (IDE), a novel approach for improving compositional generalization in vision-language models by learning the connection between words and image features, demonstrated through two models and an entropy-based inference method.
Contribution
The paper proposes IDE, a new method for enhancing compositional generalization in vision-language models, with two models utilizing disentangled features and a novel inference technique.
Findings
Models outperform existing methods on unseen compositions
Disentangled representations improve generalization
Entropy-based inference effectively combines word predictions
Abstract
Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
