Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding
Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi, Chu, Nenghai Yu

TL;DR
This paper introduces a transformer-based framework that leverages modality-specific features for improved multi-modal manipulation detection and grounding, outperforming existing methods by preserving modality uniqueness and enhancing forged detail discovery.
Contribution
The proposed model uniquely combines visual/language pre-trained encoders, dual-branch cross-attention, decoupled classifiers, and an implicit manipulation query to enhance detection and grounding of multi-modal manipulations.
Findings
Outperforms state-of-the-art on DGM4 dataset
Effectively exploits modality-specific features
Improves forged detail detection
Abstract
AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsFocus
