Buffer replay enhances the robustness of multimodal learning under missing-modality
Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang

TL;DR
This paper introduces REplay Prompting (REP), a novel method that enhances the robustness of multimodal learning models against missing modalities by using feature buffers and dynamic configurations, outperforming prior approaches.
Contribution
REP is a lightweight framework that constructs modality-specific feature buffers and employs a dynamic initialization mechanism to improve robustness in missing-modality scenarios.
Findings
REP outperforms prior methods on various benchmarks.
REP maintains high performance with minimal parameter overhead.
REP is effective across vision-language, audio, and temporal multimodal tasks.
Abstract
Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech and Audio Processing
