TL;DR
This paper presents a framework for creating multimodal datasets with known mutual information, enabling systematic evaluation of MI estimators and advancing multimodal self-supervised learning research.
Contribution
The authors introduce a novel framework for generating realistic multimodal datasets with explicitly calculable mutual information using flow-based models and causal structures.
Findings
Benchmarking of MI estimators shows performance improves with higher MI.
Framework applicable to astrophysics and SSL in highly multimodal settings.
Provides a new testbed for studying mutual information in multimodal data.
Abstract
We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on datasets with varying ground truth MI values and verify that regression performance improves as the MI increases between input modalities and the target value. Finally, we describe how our framework can be applied to contexts including multi-detector astrophysics and SSL studies in the highly multimodal regime.
Peer Reviews
Decision·Submitted to ICLR 2026
Although data generation is not my primary area of expertise, this work appears to address a genuinely underexplored and important problem: constructing realistic high-dimensional multimodal datasets with analytically tractable and controllable mutual information, which could enable systematic evaluation of self-supervised learning methods and mutual information estimators. The theoretical development is simple and clear. The use of flow-based generative models to maintain information structure
The main limitation of this paper lies in the absence of empirical validation. While the framework is theoretically elegant, the paper does not demonstrate that the generated datasets are practically useful for their intended purposes, such as evaluating self-supervised learning methods or mutual information estimators. The examples provided are purely illustrative and rely on analytic expressions rather than experiments that confirm controllability or MI preservation in practice. Moreover, the
1. The paper proposes a framework for generating high-dimensional multimodal data with controllable mutual information, which is rarely achieved in existing public datasets or prior methods. 2. By leveraging flow-based generative models, the approach ensures that the generated data preserves mutual information between latent variables, providing a theoretical foundation.
1. All experiments are conducted solely on CIFAR-10 image data, without demonstrating results on real multimodal datasets (e.g., CMU-MOSI, CMU-MOSEI, or video-text-audio combinations). 2. The paper does not evaluate the generated data on downstream tasks (e.g., regression or classification), making it difficult to quantitatively assess its contribution. It also lacks direct comparison with existing mutual information estimators or multimodal SSL approaches. 3. Some concepts (e.g., the template
- The paper is well-motivated, as there has been an emerging interests in multimodal learning from an information-theoretic approach, and this paper provides a well-suited, controlled testbed for such types of research; - The proposed data generation pipeline is novel, well-documented and clearly explained;
- One major limitation of this work is the lack of empirical evaluation, neither qualitative evaluation (e.g. Figure 2, which the paper also acknowledges that "there is no clear visual connection between these pairs of images") nor quantitative evaluation. This makes it **very hard to verify the correctness** of the proposed framework. In particular, the reviewer does not agree with the claim that "our framework allows us to state unequivocally that these high-dimensional, complex datasets have
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
