Memory-Anchored Multimodal Reasoning for Explainable Video Forensics

Chen Chen; Runze Li; Zejun Zhang; Pukun Zhao; Fanqing Zhou; Longxiang Wang; Haojian Huang

arXiv:2508.14581·cs.MM·September 11, 2025

Memory-Anchored Multimodal Reasoning for Explainable Video Forensics

Chen Chen, Runze Li, Zejun Zhang, Pukun Zhao, Fanqing Zhou, Longxiang Wang, Haojian Huang

PDF

Open Access

TL;DR

FakeHunter is a novel multimodal framework for video fake detection that combines memory-guided retrieval, reasoning loops, and adaptive tool use to improve robustness and explainability in detecting manipulations.

Contribution

It introduces FakeHunter, integrating memory retrieval, reasoning, and adaptive analysis, along with a new benchmark for comprehensive evaluation of multimodal fake detection methods.

Findings

01

FakeHunter outperforms existing multimodal baselines.

02

Memory-guided retrieval enhances contextual understanding.

03

Selective analysis improves detection robustness and explanation quality.

Abstract

We address multimodal deepfake detection requiring both robustness and interpretability by proposing FakeHunter, a unified framework that combines memory guided retrieval, a structured Observation-Thought-Action reasoning loop, and adaptive forensic tool invocation. Visual representations from a Contrastive Language-Image Pretraining (CLIP) model and audio representations from a Contrastive Language-Audio Pretraining (CLAP) model retrieve semantically aligned authentic exemplars from a large scale memory, providing contextual anchors that guide iterative localization and explanation of suspected manipulations. Under low internal confidence the framework selectively triggers fine grained analyses such as spatial region zoom and mel spectrogram inspection to gather discriminative evidence instead of relying on opaque marginal scores. We also release X-AVFake, a comprehensive audio visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications