TL;DR
This paper introduces a scalable diffusion-based method with dynamic modality gating and mutual learning to restore missing modalities in vision-language models, enhancing their robustness without fine-tuning.
Contribution
It proposes a novel diffusion-based restoration strategy that maintains pre-trained VLMs' integrity and improves zero-shot robustness to missing modalities.
Findings
Outperforms existing baselines in zero-shot evaluations
Maintains original VLM integrity without fine-tuning
Effective across diverse missing rates and conditions
Abstract
Vision Language Model (VLM) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research on missing modality primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and degrade the generalizability of VLM. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM's generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
