HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection

Jialei Cui; Jianwei Du; Yanzhe Li; Lei Gao; Hui Jiang; Chenfu Bao

arXiv:2507.20913·cs.CV·July 29, 2025

HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection

Jialei Cui, Jianwei Du, Yanzhe Li, Lei Gao, Hui Jiang, Chenfu Bao

PDF

TL;DR

HAMLET-FFD is a hierarchical, multi-modal framework inspired by cognition that improves face forgery detection by integrating visual and textual cues through a bidirectional fusion mechanism, enhancing cross-domain generalization.

Contribution

It introduces a novel hierarchical multi-modal learning approach with a knowledge refinement loop and bidirectional fusion, leveraging CLIP without fine-tuning for improved forgery detection.

Findings

01

Outperforms existing methods on unseen manipulations

02

Enhances cross-domain generalization in face forgery detection

03

Reveals specialized embeddings for artifact recognition

Abstract

The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.