Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol
Konstantinos Apostolidis, Jakob Abesser, Luca Cuccovillo, Vasileios, Mezaris

TL;DR
This paper introduces a baseline method and protocol for detecting audio-visual discrepancies in videos, utilizing a scene classifier to improve content verification and establish a standard evaluation framework.
Contribution
It presents a novel audio-visual scene classifier and an experimental protocol with a benchmark dataset for detecting inconsistencies between audio and video content.
Findings
Achieved state-of-the-art scene classification accuracy
Demonstrated promising results in detecting audio-visual discrepancies
Provided a new benchmark dataset and evaluation protocol
Abstract
This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
