URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang; Weichen Cheng; Weijia Li; Junjie Mou; Zongyou Zhao; Guoying Zhang

arXiv:2604.06728·cs.CV·May 5, 2026

URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang, Weichen Cheng, Weijia Li, Junjie Mou, Zongyou Zhao, Guoying Zhang

PDF

TL;DR

URMF introduces an uncertainty-aware framework that dynamically weights modalities during fusion, significantly improving robustness and accuracy in multimodal sarcasm detection by modeling modality uncertainties.

Contribution

The paper proposes a novel uncertainty-aware fusion method that models modality-specific uncertainty to enhance robustness in multimodal sarcasm detection.

Findings

01

URMF outperforms existing baselines on MSD and MMSD2 benchmarks.

02

Explicit uncertainty modeling improves both accuracy and robustness.

03

Dynamic modality weighting reduces impact of noisy or unreliable evidence.

Abstract

Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, most still treat modalities as equally reliable. In real social media posts, however, text and images often differ in noise level and relevance, making deterministic fusion susceptible to noisy evidence and weakened incongruity cues. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework for robust MSD. URMF first injects visual evidence into textual representations through multi-head cross-attention, and then applies self-attention in the fused semantic space to enhance incongruity reasoning. It models textual, visual, and interaction-aware representations as learnable Gaussian posteriors to estimate modality-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.