How Far Are We from Generating Missing Modalities with Foundation Models?
Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

TL;DR
This paper evaluates the current capabilities of foundation models in reconstructing missing modalities, identifies key limitations, and proposes an agentic framework with self-refinement to improve reconstruction quality, demonstrating significant performance gains.
Contribution
It introduces a novel agentic framework with self-refinement for missing modality reconstruction, addressing semantic extraction and validation challenges in foundation models.
Findings
Reduces FID by at least 14% for image reconstruction
Reduces MER by at least 10% for text reconstruction
Provides a comprehensive evaluation of 42 model variants
Abstract
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
