Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin

TL;DR
MaLSF is a novel framework for multimodal media verification that actively detects subtle semantic conflicts by leveraging mask-label pairs and bidirectional cross-modal verification, outperforming existing methods.
Contribution
Introduces MaLSF, a new active, bidirectional verification framework using mask-label pairs to improve detection of subtle multimodal misinformation.
Findings
Achieves state-of-the-art results on DGM4 and fake news detection tasks.
Effectively identifies local semantic conflicts that passive methods miss.
Provides interpretability through visualization and ablation studies.
Abstract
As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
