Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Zizhao Chen; Ping Wei; Ziyang Ren; Huan Li; Xiangru Yin

arXiv:2603.26052·cs.CV·March 30, 2026

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin

PDF

TL;DR

MaLSF is a novel framework for multimodal media verification that actively detects subtle semantic conflicts by leveraging mask-label pairs and bidirectional cross-modal verification, outperforming existing methods.

Contribution

Introduces MaLSF, a new active, bidirectional verification framework using mask-label pairs to improve detection of subtle multimodal misinformation.

Findings

01

Achieves state-of-the-art results on DGM4 and fake news detection tasks.

02

Effectively identifies local semantic conflicts that passive methods miss.

03

Provides interpretability through visualization and ablation studies.

Abstract

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.