Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee (Kim Jaechul Graduate School of AI, KAIST)

TL;DR
This paper introduces LIM, a lightweight module that enhances text-only vision-language model performance by predicting imagined visual embeddings, improving accuracy and calibration without image synthesis.
Contribution
The paper proposes the Latent Imagination Module (LIM), a novel cross-attention component that predicts visual embeddings from text to improve VLMs in missing-modality scenarios.
Findings
LIM improves accuracy on text-only benchmarks.
LIM reduces calibration errors in missing-modality settings.
LIM enhances model reliability across unseen tasks.
Abstract
Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
