TL;DR
This paper introduces a generalized ELBO for multimodal data that overcomes limitations of existing models, enabling better joint data distribution learning and semantic coherence in self-supervised generative tasks.
Contribution
A new generalized ELBO formulation that unifies and improves upon previous methods for multimodal data modeling in self-supervised learning.
Findings
Outperforms state-of-the-art models in generative tasks
Encompasses previous methods as special cases
Enhances semantic coherence and joint distribution learning
Abstract
Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in self-supervised, generative learning tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
