Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
Scott McCrae, Kehan Wang, Avideh Zakhor

TL;DR
This paper presents a multi-modal framework for detecting semantic mismatches in social media news posts, combining text, audio, and video analysis to improve accuracy over uni-modal methods.
Contribution
The paper introduces a novel multi-modal fusion architecture and a new dataset for detecting semantic inconsistencies in social media videos and captions.
Findings
Achieves 60.5% accuracy in mismatch detection
Fusion across multiple modalities improves performance
A new dataset of 4,000 Facebook news posts was curated
Abstract
As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Multimodal Machine Learning Applications · Misinformation and Its Impacts
