Buffer replay enhances the robustness of multimodal learning under missing-modality

Hongye Zhu; Xuan Liu; Yanwen Ba; Jingye Xue; Shigeng Zhang

arXiv:2511.23070·cs.CV·December 1, 2025

Buffer replay enhances the robustness of multimodal learning under missing-modality

Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang

PDF

Open Access

TL;DR

This paper introduces REplay Prompting (REP), a novel method that enhances the robustness of multimodal learning models against missing modalities by using feature buffers and dynamic configurations, outperforming prior approaches.

Contribution

REP is a lightweight framework that constructs modality-specific feature buffers and employs a dynamic initialization mechanism to improve robustness in missing-modality scenarios.

Findings

01

REP outperforms prior methods on various benchmarks.

02

REP maintains high performance with minimal parameter overhead.

03

REP is effective across vision-language, audio, and temporal multimodal tasks.

Abstract

Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech and Audio Processing