DCER: Dual-Stage Compression and Energy-Based Reconstruction
Yiwen Wang, Jiahao Qin

TL;DR
DCER introduces a dual-stage compression and energy-based reconstruction framework that enhances robustness in multimodal fusion, effectively handling noisy inputs and missing modalities, achieving state-of-the-art results.
Contribution
The paper presents a novel unified framework combining frequency-based compression and energy-based reconstruction to improve multimodal robustness against noise and missing data.
Findings
State-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets.
Energy-based uncertainty correlates strongly with prediction errors.
Robustness pattern favors multimodal fusion even with high missing modalities.
Abstract
Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
