Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Hao Tan; Jun Lan; Zichang Tan; Ajian Liu; Chuanbiao Song; Senyuan Shi; Huijia Zhu; Weiqiang Wang; Jun Wan; Zhen Lei

arXiv:2508.21048·cs.CV·March 2, 2026

Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Veritas, a novel multi-modal large language model-based deepfake detector that employs pattern-aware reasoning and is trained on a new challenging dataset, HydraFake, to improve generalization to unseen forgeries and domains.

Contribution

The paper presents Veritas, a deepfake detection method with pattern-aware reasoning and a two-stage training pipeline, addressing the gap between academic benchmarks and real-world scenarios.

Findings

01

Veritas outperforms previous detectors on HydraFake's cross-model and unseen forgery scenarios.

02

HydraFake provides a more realistic benchmark with diverse forgeries and domains.

03

Veritas offers transparent and faithful detection outputs.

Abstract

Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT),…

Tables10

Table 1. Table 1: Performance comparison (Acc.) on HydraFake dataset. In-domain (ID) results are averaged. To ensure fair comparisons with MLLM-based detectors, 1) we exclude ID set in their average results and 2) further restrict the training scope of our method to FF++, StyleGAN, StableDiffusion XL and FFHQ (similar to FFAA), yielding “ Veritas-mini ”. The best results are bolded and second best are underlined . More metrics in Appendix A.5 .

Method	ID	Cross-Model						Cross-Forgery						Cross-Domain						Avg.
		ADF	FLUX	StarryAI	MAGI-1	HART	Infinity	St.GAN2	ICLight	CodeF.	InfiniteY.	PuLID	FaceAda.	Deepface.	InfiniteY.	Dreamina	HailuoAI	GPT-4o	FFIW
Small Vision Models
F3Net (ECCV’20)	85.3	86.7	87.8	78.6	85.0	86.0	82.9	41.3	48.9	71.9	84.9	85.5	72.6	57.7	78.5	55.6	68.6	66.2	66.4	73.2
UniFD (CVPR’23)	82.7	90.7	93.8	82.5	73.0	94.4	90.7	61.8	81.9	75.4	73.7	68.1	81.3	67.4	67.3	80.5	75.2	73.3	67.5	78.0
IID (CVPR’23)	83.4	83.3	82.8	80.0	80.2	81.1	82.2	41.4	53.3	79.7	81.8	81.8	73.7	65.2	69.9	63.8	63.3	63.8	64.2	72.4
FreqNet (AAAI’24)	66.8	60.3	76.7	59.0	69.2	77.1	75.1	33.1	73.1	70.3	72.8	77.4	67.7	50.6	67.0	62.1	59.3	58.3	51.2	64.6
ProDet (NIPS’24)	90.5	92.6	94.2	88.2	91.9	93.8	93.1	56.3	58.6	80.8	88.1	91.0	83.3	58.1	82.9	71.3	75.6	66.3	74.1	80.6
NPR (CVPR’24)	75.6	68.8	91.2	59.5	82.6	91.3	84.0	47.7	67.8	60.6	79.8	89.0	67.7	52.6	73.0	76.6	62.3	50.2	46.0	69.8
AIDE (ICLR’25)	80.4	68.8	86.3	64.0	88.9	95.4	76.0	56.7	79.2	86.1	74.2	62.4	75.7	59.7	67.9	49.7	58.0	51.9	59.2	70.6
Co-SPY (CVPR’25)	86.3	93.5	95.5	85.3	93.3	96.6	95.3	77.0	92.5	88.6	90.6	79.1	87.3	67.6	80.0	82.5	74.0	79.5	64.3	84.7
D³ (CVPR’25)	87.3	93.6	95.6	91.3	90.7	95.8	95.5	62.4	71.6	82.9	80.0	82.4	73.7	69.7	74.6	78.1	70.9	80.8	64.3	81.1
Effort (ICML’25)	94.7	82.8	96.5	78.0	90.5	97.8	98.3	64.7	94.8	89.7	89.5	92.9	88.0	64.8	82.2	61.5	66.4	53.8	74.0	82.2
Generic MLLMs
Qwen2.5-VL-7B	51.2	50.0	50.0	49.7	50.0	52.0	52.9	50.5	56.7	50.7	53.6	54.5	51.6	50.7	53.6	80.2	67.5	52.5	50.5	54.1
InternVL3-8B	54.0	54.0	49.8	49.0	56.6	55.8	57.2	62.9	54.2	62.9	63.6	54.8	67.7	54.4	67.1	77.1	66.5	47.4	51.8	58.3
MiMo-VL-7B	63.8	74.5	77.1	82.5	60.3	82.4	81.4	48.7	82.6	76.4	79.7	78.4	82.8	57.7	75.6	79.4	70.7	67.7	54.9	72.5
GLM-4.1V-9BThink	56.4	55.2	52.3	50.5	51.6	68.4	60.7	54.3	68.4	63.3	65.7	55.1	81.0	58.7	72.7	83.7	69.2	52.0	53.9	61.7
GPT-4o	53.5	57.7	52.0	51.4	59.9	81.2	54.8	66.4	58.9	52.5	64.4	60.9	55.5	49.4	62.0	90.7	73.7	58.0	52.8	60.8
Gemini-2.5-Pro	72.2	64.9	92.4	82.8	62.5	93.4	93.2	73.7	83.3	87.4	85.5	84.7	85.6	67.2	75.6	87.5	82.4	70.9	53.0	78.9
MLLM-based Forgery Detectors
M2F2-Det (CVPR’25)	-	56.0	57.7	59.8	61.8	61.3	55.4	78.9	65.5	80.0	57.4	57.5	76.3	73.0	56.3	67.2	50.6	53.0	70.6	63.2
FakeShield (ICLR’25)	-	64.3	64.0	61.5	63.1	61.8	63.3	64.0	57.3	60.9	58.1	63.6	63.7	50.2	83.8	53.8	51.3	53.9	55.6	60.8
SIDA-7B (CVPR’25)	-	97.3	97.7	79.5	59.3	98.5	95.0	59.8	60.6	62.3	89.7	94.4	63.3	50.4	81.9	80.0	78.0	68.9	57.3	76.3
SIDA-13B (CVPR’25)	-	80.7	78.5	54.8	52.5	91.3	82.4	63.7	61.2	68.2	56.7	67.1	84.3	60.8	58.2	88.3	74.0	74.1	59.9	69.8
FFAA (Arxiv’24)	-	55.1	50.9	72.9	63.5	60.8	57.6	82.7	70.9	71.8	58.4	62.4	86.0	67.7	58.4	55.3	59.2	49.6	68.3	64.0
FakeVLM (NIPS’25)	-	78.2	78.5	77.0	74.5	76.5	76.8	70.8	76.2	76.2	76.9	76.5	77.7	75.7	83.6	81.5	80.8	78.7	74.5	77.3
Veritas-mini	-	95.5	99.1	97.3	72.8	97.0	96.1	82.5	76.3	90.0	83.7	82.9	79.3	72.5	78.7	92.0	93.0	85.5	70.6	85.8
Veritas (cold-start)	96.8	79.5	99.6	96.0	99.9	99.7	99.9	84.0	65.3	94.8	86.2	93.4	86.7	55.9	73.5	93.7	89.3	88.1	76.4	87.3
Veritas (ours)	97.3	94.8	99.8	97.0	99.9	99.9	99.9	90.3	75.7	97.0	91.8	95.1	91.7	58.6	84.1	92.3	90.2	89.2	78.5	90.7

Table 2. Table 9: Data list of the hierarchical evaluation protocol in HydraFake dataset.

Evaluation Split	Method	Sub-Type	Venue	Data Scale	Resolution
In-Domain	FaceForensics++	FS	ICCV’19	8,960	$256 \times 256$
	Facevid2vid	FR	Arxiv’19	2,000	$256 \times 256$
	Hallo2	FR	ICLR’25	1,660	$256 \times 256$
	StyleGAN	EFG	CVPR’19	600	$1024 \times 1024$
	Midjourney	EFG	None	600	$1024 \times 1024$
Cross-Model	Adobe Firefly	Proprietary	None	600	$1024 \times 1024$
	StarryAI	Proprietary	None	600	$1024 \times 1024$
	Flux1.1 Pro	Customized	None	600	$1024 \times 1024$
	MAGI-1	Video AR	None	1,048	$256 \times 256$
	HART	Image AR	Arxiv’24	4,200	$1024 \times 1024$
	Infinity	Image AR	CVPR’25	4,200	$1024 \times 1024$
Cross-Forgery	StarGANv2	Editing	CVPR’20	2,000	$256 \times 256$
	CodeFormer	Restoration	NIPS’22	1,750	$512 \times 512$
	IC-Light	Relighting	ICLR’25	2,082	$1536 \times 1536$
	FaceAdapter	Generative FS	ECCV’24	300	$1024 \times 1024$
	PuLID	Personalization	NIPS’24	3,360	$1024 \times 1024$
	InfiniteYou	Personalization	ICCV’25	3,244	$1024 \times 1024$
Cross-Domain	Hailuo AI	Commercial	None	1,000	$256 \sim 1536$
	Dreamina	Social media	None	952	$1024 \times 1024$
	GPT-4o	Social media	None	630	$159 \sim 1536$
	DeepFaceLab	Classic dataset	PR 2023	3,094	$256 \times 256$
	FFIW	Classic dataset	CVPR’21	6,832	$256 \times 256$
	InfiniteYou-CD	Personalization	ICCV’25	2,960	$1024 \times 1024$

Table 3. Table 10: Performance comparison on the In-Domain (ID) subset of HydraFake dataset. The best results are bolded and the second best are underlined . We report Accuracy (Acc.), Precision (P.) and Recall (R.) and the averaged results (Avg.) are reported in Accuracy.

Method	FaceForensics++			Facevid2vid			Hallo2			StyleGAN			Midjourney			Avg.
	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.
F3Net (ECCV’20)	84.5	80.5	90.9	89.6	83.6	98.6	85.0	78.3	97.0	82.5	77.4	91.6	84.8	78.6	95.6	85.3
UniFD (CVPR’23)	73.8	80.4	62.8	81.7	83.2	79.6	77.3	88.8	62.5	94.3	91.5	97.6	86.3	90.6	81.0	82.7
IID (CVPR’23)	83.8	86.6	79.9	92.8	89.5	97.0	79.1	72.2	94.6	80.7	72.1	100.0	80.5	71.9	100.0	83.4
FreqNet (AAAI’24)	52.2	51.2	94.4	54.5	52.3	100.0	71.2	66.7	84.5	77.5	69.8	96.6	78.8	70.8	98.0	66.8
ProDet (NIPS’24)	84.8	82.2	88.8	90.9	84.6	100.0	91.4	89.0	94.4	92.5	88.7	97.3	93.0	88.4	99.0	90.5
NPR (CVPR’24)	59.6	56.5	84.1	66.6	60.5	95.7	74.3	87.2	57.0	88.8	91.7	85.3	88.7	92.9	83.6	75.6
AIDE (ICLR’25)	58.9	58.7	59.6	78.6	70.3	99.0	84.5	92.4	75.2	86.8	93.1	99.0	93.0	92.7	93.3	80.4
Co-SPY (CVPR’25)	71.3	76.0	62.3	89.3	82.9	99.1	82.1	91.6	70.7	95.0	92.2	98.3	94.0	94.0	94.0	86.3
D³ (CVPR’25)	72.5	70.3	77.8	82.1	74.7	96.9	91.1	91.7	90.3	96.3	94.2	98.6	94.7	91.3	98.6	87.3
Effort (ICML’25)	92.9	94.4	91.2	96.4	94.6	98.4	95.9	95.3	96.5	95.0	97.5	92.3	93.5	96.1	90.6	94.7
Qwen2.5-VL-7B	51.3	93.5	2.8	51.5	94.3	3.3	50.0	50.0	1.1	51.5	80.0	4.0	51.5	76.4	4.3	51.2
InternVL3-8B	55.9	55.5	59.0	53.7	53.6	54.6	56.3	83.8	15.6	52.0	66.6	8.0	52.3	69.4	8.3	54.0
MiMo-VL-7B	56.2	55.3	64.0	62.2	59.4	77.1	60.4	63.9	47.6	66.4	69.3	58.6	73.6	72.0	77.3	63.8
GLM-4.1V-9BThink	58.0	64.0	36.3	59.6	65.6	40.0	54.2	89.6	9.4	56.0	87.5	14.0	54.3	80.9	11.3	56.4
GPT-4o	49.2	20.0	0.5	53.0	84.2	8.0	49.8	33.3	0.5	63.2	100.0	26.5	52.5	91.6	5.5	53.5
Gemini-2.5-Pro	62.0	77.9	33.5	53.2	66.6	13.0	66.0	80.9	42.5	93.9	89.3	100.0	85.8	73.7	76.6	72.2
Veritas (ours)	90.9	90.2	91.7	96.1	92.7	100.0	99.8	99.6	100.0	99.8	99.6	100.0	100.0	100.0	100.0	97.3

Table 4. Table 11: Performance comparison on the Cross-Model (CM) subset of HydraFake dataset. The best results are bolded and the second best are underlined . We report Accuracy (Acc.), Precision (P.) and Recall (R.) and the averaged results (Avg.) are reported in Accuracy.

Method	Adobe Firefly			FLUX1.1Pro			StarryAI			MAGI-1			HART (VAR)			Infinity (VAR)			Avg.
	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.
F3Net (ECCV’20)	86.7	80.0	97.7	87.8	80.4	100.0	78.6	76.2	83.3	85.0	78.6	96.2	86.0	78.6	99.0	82.9	78.6	90.4	84.5
UniFD (CVPR’23)	90.7	92.9	88.0	93.8	91.7	96.3	82.5	88.8	74.3	73.0	84.7	56.1	94.4	92.2	97.0	90.7	92.8	88.2	87.5
IID (CVPR’23)	83.3	75.0	100.0	82.8	74.4	100.0	80.0	72.1	97.6	80.2	72.6	96.5	81.1	72.6	100.0	82.2	73.7	100.0	81.6
FreqNet (AAAI’24)	60.3	59.5	64.3	76.7	68.4	99.0	59.0	59.1	58.3	69.2	65.5	81.1	77.1	69.4	96.7	75.1	68.7	92.3	69.6
ProDet (NIPS’24)	92.6	88.3	98.3	94.2	89.5	100.0	88.2	88.0	88.3	91.9	87.5	97.7	93.8	89.0	99.8	93.1	88.7	98.6	92.3
NPR (CVPR’24)	68.8	86.0	45.0	91.2	91.5	90.6	59.5	78.8	26.0	82.6	91.9	71.5	91.3	93.1	89.3	84.0	92.6	73.8	79.6
AIDE (ICLR’25)	68.8	85.5	43.3	86.3	91.0	80.6	64.0	86.6	33.0	88.9	93.6	83.6	95.4	94.3	96.6	76.0	91.0	57.7	79.9
Co-SPY (CVPR’25)	93.5	93.6	93.3	95.5	94.1	97.0	85.3	93.4	76.0	93.3	92.2	72.7	96.6	94.5	98.9	95.3	94.1	96.7	93.3
D³ (CVPR’25)	93.6	90.3	97.7	95.6	92.3	99.6	91.3	92.2	90.3	90.7	90.5	91.0	95.8	92.2	99.9	95.5	93.1	98.3	93.8
Effort (ICML’25)	82.8	97.1	67.6	96.5	96.0	97.0	78.0	94.2	59.6	90.5	96.5	84.1	97.8	97.2	98.5	98.3	97.3	99.4	90.7
Qwen2.5-VL-7B	50.0	50.0	1.3	50.0	50.0	1.3	49.7	0.0	0.0	50.0	50.0	0.9	52.0	83.8	4.9	52.9	87.3	6.8	50.8
InternVL3-8B	54.0	74.0	12.3	49.8	47.3	3.0	49.0	28.5	1.3	56.6	83.5	16.4	55.8	83.1	14.5	57.2	84.5	17.6	53.7
MiMo-VL-7B	74.5	74.7	74.0	77.1	76.7	78.0	82.5	68.0	55.3	60.3	64.0	47.1	82.4	77.9	90.6	81.4	77.2	89.2	76.4
GLM-4.1V-9BThink	55.2	82.9	13.0	52.3	76.9	6.6	50.5	66.6	2.0	51.6	81.5	4.2	68.4	96.1	38.2	60.7	95.5	22.5	56.5
GPT-4o	57.7	96.9	16.0	52.0	83.3	5.0	51.4	85.7	3.0	59.9	80.0	26.4	81.2	92.5	67.9	54.8	87.5	10.6	59.5
Gemini-2.5-Pro	64.9	78.8	41.2	92.4	90.3	94.9	82.8	92.3	72.0	62.5	79.0	34.1	93.4	88.5	100.0	93.2	89.1	98.5	81.5
Veritas (ours)	94.8	99.6	89.9	99.8	99.6	100.0	97.0	100.0	94.0	99.9	99.8	100.0	99.9	99.8	100.0	99.9	99.8	99.9	98.6

Table 5. Table 12: Performance comparison on the Cross-Forgery (CF) subset of HydraFake dataset. The best results are bolded and the second best are underlined . We report Accuracy (Acc.), Precision (P.) and Recall (R.) and the averaged results (Avg.) are reported in Accuracy.

Method	StarGANv2			IC-Light			CodeFormer			InfiniteYou			PuLID			FaceAdapter			Avg.
	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.
F3Net (ECCV’20)	41.3	20.7	6.2	48.9	47.9	24.8	71.9	72.8	69.8	84.9	78.2	96.8	85.5	79.5	95.7	72.6	77.5	62.8	67.5
UniFD (CVPR’23)	61.8	81.7	30.4	81.9	89.1	72.6	75.4	88.0	58.7	73.7	87.0	55.6	68.1	83.6	45.1	81.3	90.3	69.6	73.7
IID (CVPR’23)	41.4	32.8	16.6	53.3	54.3	41.0	79.7	72.9	94.4	81.8	73.4	99.6	81.8	73.3	99.9	73.7	68.2	87.1	68.6
FreqNet (AAAI’24)	33.1	11.4	5.0	73.1	67.6	88.9	70.3	66.6	81.5	72.8	67.4	88.3	77.4	69.9	96.2	67.7	64.9	75.0	65.7
ProDet (NIPS’24)	56.3	67.5	24.5	58.6	69.8	34.4	80.8	84.7	75.3	88.1	89.9	85.8	91.0	88.1	94.9	83.3	86.0	79.0	76.4
NPR (CVPR’24)	47.7	23.6	2.1	67.8	85.2	43.2	60.6	78.3	29.3	79.8	99.1	66.0	89.0	92.8	84.5	67.7	84.9	41.9	68.8
AIDE (ICLR’25)	56.7	74.6	20.3	79.2	91.5	64.3	86.1	90.8	80.2	74.2	89.4	55.0	62.4	83.9	30.5	75.7	90.3	56.7	72.4
Co-SPY (CVPR’25)	77.0	91.1	59.9	92.5	93.8	91.0	88.6	90.3	86.5	90.6	93.2	87.5	79.1	89.9	65.5	87.3	88.7	85.4	85.9
D³ (CVPR’25)	62.4	82.4	31.5	71.6	86.3	51.4	82.9	90.9	73.1	80.0	88.7	68.7	82.4	89.8	73.1	73.7	84.1	57.4	75.5
Effort (ICML’25)	64.7	93.2	31.7	94.8	96.3	93.2	89.7	97.1	81.9	89.5	96.0	82.3	92.9	96.2	89.2	88.0	95.0	80.2	86.6
Qwen2.5-VL-7B	50.5	64.7	2.2	56.7	90.8	15.2	50.7	70.0	2.4	53.6	90.3	8.1	54.5	90.4	10.1	51.6	63.6	4.7	52.9
InternVL3-8B	62.9	87.7	30.0	54.2	77.1	12.0	62.9	86.9	30.5	63.6	92.1	29.7	54.8	80.1	12.7	67.7	96.3	35.8	61.0
MiMo-VL-7B	48.7	47.7	26.2	82.6	76.4	94.3	76.4	75.1	78.9	79.7	76.3	85.7	78.4	75.8	83.5	82.8	79.1	88.3	74.8
GLM-4.1V-9BThink	54.3	89.8	9.7	68.4	94.7	39.0	63.3	95.7	27.8	65.7	96.5	32.6	55.1	87.4	11.9	81.0	97.9	62.8	64.6
GPT-4o	66.4	82.6	41.5	58.9	100.0	18.0	52.5	91.6	5.5	64.4	100.0	29.0	60.9	94.0	23.5	55.5	100.0	10.1	59.8
Gemini-2.5-Pro	73.7	89.7	53.5	83.3	89.7	75.1	87.4	86.0	89.3	85.5	86.2	84.5	84.7	88.4	80.0	85.6	85.3	85.8	83.4
Veritas (ours)	90.3	99.5	81.0	75.7	99.5	51.5	97.0	98.6	95.3	91.8	98.9	84.5	95.1	99.5	90.6	91.7	98.6	84.6	90.3

Table 6. Table 13: Performance comparison on the Cross-Domain (CD) subset of HydraFake dataset. The best results are bolded and the second best are underlined . We report Accuracy (Acc.), Precision (P.) and Recall (R.) and the averaged results (Avg.) are reported in Accuracy.

Method	DeepFaceLab			InfiniteYou-CD			Dreamina			Hailuo AI			GPT-4o			FFIW			Avg.
	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.	Acc.	P.	R.
F3Net (ECCV’20)	57.7	54.7	89.1	78.5	72.6	91.6	55.6	55.0	60.9	68.6	63.3	88.6	66.2	63.6	75.5	66.4	64.8	71.4	65.5
UniFD (CVPR’23)	67.4	64.1	79.2	67.3	75.9	50.5	80.5	77.2	86.3	75.2	75.4	74.8	73.3	73.6	72.7	67.5	65.4	74.5	71.9
IID (CVPR’23)	65.2	65.1	65.5	69.9	63.0	95.9	63.8	58.0	99.6	63.3	57.6	100.0	63.8	58.2	97.8	64.2	66.6	56.6	65.0
FreqNet (AAAI’24)	50.6	50.3	98.7	67.0	62.3	85.8	62.1	58.4	83.4	59.3	56.9	76.8	58.3	56.3	73.3	51.2	50.6	95.0	58.1
ProDet (NIPS’24)	58.1	54.6	95.3	82.9	81.3	85.4	71.3	71.3	71.4	75.6	71.2	86.0	66.3	69.9	57.4	74.1	73.0	76.5	71.4
NPR (CVPR’24)	52.6	52.4	56.6	73.0	82.1	58.7	76.6	80.9	69.5	62.3	70.7	42.0	50.2	50.4	20.9	46.0	46.3	50.9	60.1
AIDE (ICLR’25)	59.7	57.5	73.9	67.9	75.2	53.2	49.7	49.4	25.2	58.0	60.6	45.8	51.9	53.0	33.3	59.2	57.3	71.2	57.7
Co-SPY (CVPR’25)	67.6	63.8	81.2	80.0	85.1	72.7	82.5	80.7	85.3	74.0	75.7	70.6	79.5	78.7	80.9	64.3	64.2	64.7	74.7
D³ (CVPR’25)	69.7	70.0	69.1	74.6	78.8	67.2	78.1	72.2	91.4	70.9	67.7	79.8	80.8	75.1	92.0	64.3	66.0	58.7	73.1
Effort (ICML’25)	64.8	59.1	96.3	82.2	75.4	95.5	61.5	57.6	86.5	66.4	60.1	97.4	53.8	52.9	69.3	74.0	68.3	89.7	67.1
Qwen2.5-VL-7B	50.7	71.7	2.4	53.6	92.8	7.9	80.2	99.4	60.7	67.5	100.0	35.1	52.5	100.0	5.1	50.5	56.1	4.5	59.2
InternVL3-8B	54.4	55.5	44.1	67.1	83.6	42.7	77.1	85.4	65.3	66.5	78.9	45.0	47.4	37.5	7.6	51.8	52.0	45.1	60.7
MiMo-VL-7B	57.7	56.4	68.6	75.6	73.7	79.6	79.4	73.5	91.9	70.7	68.3	76.5	67.7	67.1	69.5	54.9	54.8	56.2	67.7
GLM-4.1V-9BThink	58.7	82.6	22.0	72.7	95.1	47.8	83.7	96.0	70.3	69.2	92.1	42.0	52.0	68.4	8.2	53.9	60.8	21.7	65.0
GPT-4o	49.4	41.7	2.5	62.0	98.0	24.5	90.7	98.8	82.4	73.7	100.0	47.5	58.0	94.4	17.0	52.8	59.3	17.8	64.4
Gemini-2.5-Pro	67.2	72.4	55.6	75.6	88.7	59.0	87.5	89.5	84.9	82.4	95.2	68.2	70.9	96.6	43.2	53.0	67.6	11.5	72.8
Veritas (ours)	58.6	54.7	100.0	84.1	94.1	72.8	92.3	90.0	95.1	90.2	86.5	95.2	89.2	86.3	93.2	78.5	76.1	83.0	82.2

Table 7. Table 14: Cross benchmark comparison. Performance (Acc.) on AIGIBench (Li et al., 2025 ) and the HydraFake-CD set. For previous methods, we implement two settings: (1) train on FF++ (Rossler et al., 2019 ) similar to previous setting, and (2) train on HydraFake dataset that contain multiple sources. The quantity of training samples for FF++ and HydraFake are kept consistent. The performance of recent methods increase when trained on more diverse sources (highlighted in gray ), while similar gains are not observed for deepfake detection methods (highlighted in blue ).

Method	Training	HydraFake-CD						Avg.	AIGIBench								Avg.
		Deepface.	InfiniteY.	Dreamina	Hailuo AI	GPT-4o	FFIW		BLIP	E4S	InfiniteID	InSwap	IPAdapter	R3GAN	StyleSwim	WFIR
F3Net (ECCV’20)	FF++	52.9	56.4	63.8	76.0	69.5	52.6	61.9	65.6	44.1	51.8	52.5	72.7	51.1	47.1	46.1	53.9
UniFD (CVPR’23)	FF++	72.9	54.3	63.0	59.2	57.6	66.9	62.3	51.3	80.7	77.1	63.5	67.3	67.1	78.3	71.1	69.6
IID (CVPR’23)	FF++	51.1	61.8	81.1	83.3	75.6	61.2	69.0	64.6	39.2	80.8	44.7	87.3	52.6	48.0	45.6	57.9
ProDet (NIPS’24)	FF++	63.2	64.8	85.6	83.8	70.8	63.1	71.9	65.2	46.6	84.4	55.1	81.4	51.1	49.3	48.2	60.2
Co-SPY (CVPR’25)	FF++	67.6	82.0	82.4	74.0	64.3	64.3	72.4	78.2	86.2	80.0	82.6	55.5	76.3	86.3	92.8	79.7
D³ (CVPR’25)	FF++	71.8	65.4	59.6	44.3	48.4	62.9	58.7	48.7	80.3	60.0	68.9	55.5	60.3	66.3	59.4	62.4
Effort (ICML’25)	FF++	76.2	76.9	55.3	49.0	49.7	70.7	63.0	83.8	85.9	82.2	77.3	71.3	81.4	82.3	83.1	80.9
F3Net (ECCV’20)	HydraFake	57.7	78.5	55.6	68.6	66.2	66.4	65.5	72.1	77.7	69.8	80.4	62.9	53.5	44.3	74.5	66.9
UniFD (CVPR’23)	HydraFake	67.4	67.3	80.5	75.2	73.3	67.5	71.9	60.2	90.4	64.9	76.1	80.3	82.3	78.2	91.2	78.0
IID (CVPR’23)	HydraFake	65.2	69.9	63.8	63.3	63.8	64.2	65.0	78.3	65.8	76.0	62.7	77.8	58.3	48.2	81.3	68.6
ProDet (NIPS’24)	HydraFake	58.1	82.9	71.3	75.6	66.3	74.1	71.4	82.4	85.9	81.1	86.7	71.7	54.3	47.7	88.4	74.8
Co-SPY (CVPR’25)	HydraFake	67.6	80.0	82.5	74.0	79.5	64.3	74.7	77.6	88.1	89.2	81.3	57.4	78.7	86.2	94.2	81.6
D³ (CVPR’25)	HydraFake	69.7	74.6	78.1	70.9	80.8	64.3	73.1	71.4	87.8	80.9	78.9	68.6	77.9	63.7	93.4	77.8
Effort (ICML’25)	HydraFake	64.8	82.2	61.5	66.4	53.8	74.0	67.1	82.1	89.1	84.0	85.8	84.8	87.1	82.3	82.7	84.7
Veritas (ours)	HydraFake	58.6	84.1	92.3	90.2	89.2	78.5	82.2	81.9	93.5	88.4	89.3	81.8	91.3	85.2	99.8	88.9

Table 8. Table 15: Performance comparison on broader benchmarks, including LOKI (Ye et al., 2024 ) , FakeClue (Wen et al., 2025c ) , Forensics-Bench (Wang et al., 2025b ) , AIGIBench (Li et al., 2025 ) and Nano-banana-150K (Ye et al., 2025b ) . Results of facial data in LOKI are also reported, since we target at deepfake detection.

Method	LOKI		LOKI (facial)		FakeClue		Forensics-Bench		AIGIBench		Nano-banana
	Acc.	F1	Acc.	F1	Acc.	F1	Acc.	F1	Acc.	F1	Acc.	F1
UniFD (CVPR’23)	54.5	58.7	74.8	69.7	61.6	64.2	53.6	54.8	78.0	80.2	49.1	36.0
ProDet (NIPS’24)	53.8	56.6	63.2	66.4	62.9	69.8	65.1	72.0	74.8	73.9	64.6	63.8
Co-SPY (CVPR’25)	61.7	65.8	79.1	75.6	68.1	72.4	70.8	76.0	81.6	84.3	52.0	40.9
D³ (CVPR’25)	47.3	41.2	79.5	80.1	60.7	59.2	56.6	59.8	77.8	75.3	70.7	73.0
Effort (ICML’25)	53.8	50.0	84.3	84.6	65.0	63.2	57.0	59.4	84.7	87.6	62.7	52.1
InternVL3-8B	53.0	51.6	52.6	15.3	59.1	62.2	60.5	65.4	55.6	56.5	51.3	38.9
MiMo-VL-7B	65.1	64.3	69.7	65.0	67.2	71.6	63.8	70.5	62.8	64.3	60.7	55.6
Veritas (ours)	72.1	77.8	89.0	88.2	85.9	88.4	70.8	74.9	88.9	90.4	86.3	89.0

Table 9. Table 16: Analysis on efficiency. We calculate the inference time (seconds) of a batch of images with the batch size set to 8 8 . The experiments are conducted on a single PPUE GPU with the Transformers library. We select samples of different resolutions and difficulty for illustration.

Model	Low Resolution ( $↓$ )		High Resolution ( $↓$ )		Acc. ( $↑$ )
	Easy	Hard	Easy	Hard
Post-hoc Explanation	27.08	28.19	37.64	36.06	83.4
Flexible Reasoning	24.60	29.37	44.43	47.44	86.8
Ours (cold-start)	19.26	24.94	40.72	46.40	89.3
Ours	22.35	25.60	49.76	54.22	92.1
$𝚫$ Post-hoc Exp.	$↓$ 4.73	$↓$ 2.59	$↑$ 12.12	$↑$ 18.16	$↑$ 8.7
$𝚫$ Flexible Reason.	$↓$ 2.25	$↓$ 3.77	$↑$ 5.33	$↑$ 6.78	$↑$ 5.3

Table 10. Table 17: Ablation studies of the hyperparameter β ′ \beta^{\prime} (strength of KL penalty in P-GRPO).

Value of $β^{'}$	ID	CM	CF	CD
$β^{'} = 0.04$	96.8	98.4	89.1	80.2
$β^{'} = 0.01$	96.8	98.3	89.6	81.5
$β^{'} = 0.001$	97.3	98.6	89.3	81.9
$β^{'} = 0.0$	97.3	98.6	90.3	82.2

Equations17

L_{1} = - E_{(q, s) \sim D_{1}} t = 1 \sum T lo g π_{θ} (s_{t} ∣ q, s_{< t}),

L_{1} = - E_{(q, s) \sim D_{1}} t = 1 \sum T lo g π_{θ} (s_{t} ∣ q, s_{< t}),

L_{2} = - E_{(q, s_{w}, s_{l}) \sim D_{2}} [lo g σ (β lo g \frac{π _{θ} ( s _{w} ∣ q )}{π _{θ_{SFT}} ( s _{w} ∣ q )} - β lo g \frac{π _{θ} ( s _{l} ∣ q )}{π _{θ_{SFT}} ( s _{l} ∣ q )})],

L_{2} = - E_{(q, s_{w}, s_{l}) \sim D_{2}} [lo g σ (β lo g \frac{π _{θ} ( s _{w} ∣ q )}{π _{θ_{SFT}} ( s _{w} ∣ q )} - β lo g \frac{π _{θ} ( s _{l} ∣ q )}{π _{θ_{SFT}} ( s _{l} ∣ q )})],

L_{3} =

L_{3} =

\frac{1}{\sum _{i = 1}^{G} ∣ o _{i} ∣} i = 1 \sum G t = 1 \sum ∣ o_{i} ∣ [min (r_{i, t} (θ) A_{i, t}, clip (r_{i, t} (θ), 1 - ϵ, 1 + ϵ) A_{i, t}) - β^{^{'}} D_{KL} [π_{θ} ∥ π_{θ_{cold}}]],

r_{i, t} (θ) = \frac{π _{θ} ( o _{i, t} ∣ I , o _{i, < t} )}{π _{θ_{old}} ( o _{i, t} ∣ I , o _{i, < t} )}, A_{i, t} = \frac{R _{i} - mean ({ R _{1} , \dots , R _{G} })}{std ({ R _{1} , \dots , R _{G} })} .

r_{i, t} (θ) = \frac{π _{θ} ( o _{i, t} ∣ I , o _{i, < t} )}{π _{θ_{old}} ( o _{i, t} ∣ I , o _{i, < t} )}, A_{i, t} = \frac{R _{i} - mean ({ R _{1} , \dots , R _{G} })}{std ({ R _{1} , \dots , R _{G} })} .

R_{pattern} = ⎩ ⎨ ⎧ - - 2.0, if C = 1 \land (P = 1 \lor R = 1), 1.0, if C = 1 \land P = 0 \land R = 0, 0.0, if C = 0 \land P = 0 \land R = 0, 0.5, if C = 0 \land P = 1 \land R = 0, 1.0, if C = 0 \land R = 1.

R_{pattern} = ⎩ ⎨ ⎧ - - 2.0, if C = 1 \land (P = 1 \lor R = 1), 1.0, if C = 1 \land P = 0 \land R = 0, 0.0, if C = 0 \land P = 0 \land R = 0, 0.5, if C = 0 \land P = 1 \land R = 0, 1.0, if C = 0 \land R = 1.

R = R_{pattern} + λ_{1} R_{ref} \cdot I (C = 1) + λ_{2} R_{fmt} .

R = R_{pattern} + λ_{1} R_{ref} \cdot I (C = 1) + λ_{2} R_{fmt} .

P (s ∣ q) = t = 1 \prod A P (s_{t} ∣ q) \cdot t = A + 1 \prod A + E P (s_{t} ∣ q, s_{< A + E}),

P (s ∣ q) = t = 1 \prod A P (s_{t} ∣ q) \cdot t = A + 1 \prod A + E P (s_{t} ∣ q, s_{< A + E}),

P (s ∣ q) = t = 1 \prod R P (s_{t} ∣ q, s_{< R}) \cdot t = R + 1 \prod R + A P (s_{t} ∣ q, s_{< R + A}),

P (s ∣ q) = t = 1 \prod R P (s_{t} ∣ q, s_{< R}) \cdot t = R + 1 \prod R + A P (s_{t} ∣ q, s_{< R + A}),

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 4

Strengths

a.The proposed dataset spans diverse domains and sources of real and manipulated images, including generative face‑swapping, visual autoregressive models, and deepfakes collected from social media. b.The two‑stage training pipeline substantially outperforms existing deepfake detection methods, as demonstrated in Table 1.

Weaknesses

a.The proposed method employs SFT and GPRO within an MLLM‑based deepfake detection framework—an established post‑training strategy. b.The difference between the proposed pattern "<fast><planning><reasoning><conclusion>" and the commonly used "<think>... </think>" paradigm has not been analyzed.

Reviewer 02Rating 8Confidence 3

Strengths

- The proposed HydraFake dataset addresses a crucial gap between academic benchmarks and industrial deployment scenarios, making it valuable for real-world applications. - The authors have conducted comprehensive experiments, including comparisons with SOTA methods and detailed ablation studies, verifying the effectiveness of the proposed VERITAS model. - The pattern-aware reasoning approach is reasonable, drawing inspiration from human cognitive processes to create more interpretable and robust

Weaknesses

- The two-stage training pipeline with MiPO and P-GRPO, though modified for forgery detection tasks, still seems to be a direct application of vanilla DPO and GRPO methods, which may undercut its novelty. - The paper evaluates VERITAS exclusively on the proposed HydraFake dataset, raising concerns about overfitting to their specific evaluation protocol. It would strengthen claims about VERITAS's superiority and provide more convincing evidence of its effectiveness if authors could perform evalua

Reviewer 03Rating 8Confidence 4

Strengths

- The proposed dataset is well-motivated. The division of four evaluation levels (i.e., in-domain, cross-model, cross-forgery and cross-domain) is reasonable. A fine-grained evaluation protocol is critical and reasonable at the moment, and the constructed dataset is of high quality, providing a challenging evaluation suite for the community. - The proposed pattern-aware reasoning is effective and insightful compared to previous explainable methods. Experiments clearly show its superiority compar

Weaknesses

- The authors should provide some failure cases to understand the model’s limitations. - More fine-grained ablations on the reasoning patterns could be done, e.g., what if removing the “reflection”/“planning” pattern? - How "reflection" improves model's generalization capability to unseen forgeries? The author should provide more explanations to it. - The human has very good reasoning capabilities. Why even human cannot accurately detect some (realistic) deepfakes? Is semantic-level reasoning ca

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning

Full text

Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Hao Tan1,2,3 Jun Lan3† Zichang Tan4 Senyuan Shi2 Ajian Liu2

Chuanbiao Song3 Huijia Zhu3 Weiqiang Wang3 Jun Wan1,2,5§ Zhen Lei1,2,5

1School of Advanced Interdisciplinary Sciences (SAIS), University of Chinese Academy of Sciences

2MAIS, Institute of Automation, Chinese Academy of Sciences 3Ant Group

4Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences

5School of Artificial Intelligence, University of Chinese Academy of Sciences

{tanhao2023, jun.wan, zhen.lei}@ia.ac.cn [email protected]

†Project Lead. §Corresponding author.

Project Page: https://github.com/EricTan7/Veritas This work was done during the first author’s internship at Ant Group.

Abstract

Deepfake detection remains a formidable challenge due to the evolving nature of fake content in real-world scenarios. However, existing benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical usage of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that contains diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical patterns such as “planning” and “self-reflection” to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our **Veritas ** achieves significant gains across different out-of-domain (OOD) scenarios, and is capable of delivering transparent and faithful detection outputs.

1 Introduction

Recent advances in Generative AI (Esser et al., 2024; Tian et al., 2024) have revolutionized our digital life, unprecedentedly enriching the diversity of content on social media and short-video platforms. Though bringing immense creativity, such techniques also enable highly convincing deepfakes with minimal cost, posing significant security risks to society. Consequently, Deepfake Detection (DFD), which aims at discerning between real and generated facial images, has become a heated research frontier, galvanizing extensive efforts.

However, current detectors mostly follow a standard evaluation, which involves training on one dataset (Rossler et al., 2019) and testing on others (Dolhansky et al., 2019; Li et al., 2020b; Dolhansky et al., 2020; Zi et al., 2020; Zhou et al., 2021). Despite its popularity, this protocol fails to align with practical industrial scenarios, where abundant training samples are available yet significant out-of-distribution (OOD) generalization challenges (e.g., brand-new forgery types and meticulously synthesized facial images) emerge during testing. Such discrepancy severely hinders the practical deployment of current detectors. To mitigate the gap, we construct HydraFake dataset. As shown in Figure 2, we systematically collect and reproduce advanced deepfake methods, covering diversified deepfake techniques and in-the-wild forgeries from social media. To simulate potential challenges in real-world scenarios, we establish a rigorous and holistic evaluation protocol, where the training set consists of abundant samples but is restricted to three basic forgery types, and the evaluation involves hierarchical OOD testing, spanning in-domain, cross-model, cross-forgery and cross-domain scenarios, enabling fine-grained understanding of the model’s capacities. As presented in Figure 2 (d), under such rigorous evaluation, current SOTA detectors show great generalization on cross-model deepfakes, but limited abilities in cross-forgery and cross-domain scenarios.

To improve the robustness on unseen forgeries and data domains, we seek to ground the generalization abilities of multi-modal large language models (MLLMs) into deepfake detection. Recent efforts (Huang et al., 2024; Guo et al., 2025b; Peng et al., 2025) have made initial attempts, while they focus on the explainability and the classification is still based on expert vision models. In constrast, we explore to seamlessly internalize MLLMs into deepfake detection through their intrinsic reasoning abilities. However, directly applying deep reasoning faces a critical challenge: current MLLMs are extremely short for deepfake detection (Ren et al., 2025; Tariq et al., 2025). Effective reasoning data is necessary to ground the abilities of base model. To achieve this goal, we must answer two key questions: (1) what kind of reasoning process is helpful to DFD task? and (2) with sufficient data, how can we ensure the model is learning to reason for DFD rather than memorizing?

For the first question, we introduce a pattern-aware reasoning framework. Drawing inspiration from recent studies (Zhao et al., 2025; Muennighoff et al., 2025) that demonstrate critical reasoning patterns greatly elevate the OOD performance of LLMs, we consider the human mindset for deepfake detection: when determining the authenticity of an image, we tend to make a quick judgment based on our first impression (fast judgement), then identify one or two prominent features (reasoning) to draw a conclusion (conclusion). For more challenging samples, we may conduct a layered analysis (planning), and may also engage in more in-depth thinking to support or overturn our initial judgement (self-reflection). Based on this analogy, we extract these five thinking patterns to facilitate logical and holistic reasoning. Table 2 empirically shows the benefits of such pattern-aware reasoning over vanilla Chain-of-Thought (CoT). For the second question, we introduce a two-stage training pipeline consisting of pattern-guided cold-start and pattern-aware exploration, yielding our **Veritas 111Veritas means “Truth” in Latin.**model. During cold-start, we employ SFT to internalize thinking patterns. Besides, we introduce a Mixed Preference Optimization (MiPO) strategy that leverages mixed non-preference data and human-annotated preference data to steer the model toward faithful and fine-grained reasoning. As shown in Figure 1, MiPO greatly improves the reasoning quality, mitigating the memorizing behavior. To further facilitate adaptive planning and self-reflection, we propose Pattern-aware Group Relative Policy Optimization (P-GRPO), which shapes reasoning behavior through online sampling and pattern-aware reward mechanism. As a result, **Veritas ** shows great generalization on unseen forgeries and data domains, providing transparent and precise decision process (Figure 1).

To sum up, our main contributions are:

•

Dataset: We introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing, advancing the evaluation protocol in deepfake detection and helping developers better locate the deficiencies of their detectors.

•

Method: We propose a two-stage training pipeline that grounds the capabilities of MLLMs into deepfake detection through pattern-aware reasoning. Our model supports adaptive planning and self-reflection, delivering transparent and human-aligned decision end-to-end.

•

Performance: Our **Veritas ** model achieves significant improvements over state-of-the-art detectors on cross-forgery and cross-domain scenarios, and our cold-start model serves as a strong reasoning foundation for further customization.

2 Related work

2.1 Deepfake Detection and Datasets

Deepfake detection aims to distinguish generated facial images from authentic human faces. Previous efforts have explored spatial-level (Ojha et al., 2023; Yan et al., 2024b; Tan et al., 2024b; Nguyen et al., 2024; Fu et al., 2025; Yan et al., 2024a; c; Yang et al., 2025c), frequency-level (Qian et al., 2020; Tan et al., 2024a; Zhou et al., 2024; Kashiani et al., 2025) and sequence-level (Gu et al., 2021; 2022b; 2022a; Yan et al., 2025) approaches, achieving remarkable progress on traditional benchmarks. To train a generalizable detector, some methods attempt to find “bias-free” fake images either through spatial-domain blending (Li et al., 2020a; Shiohara & Yamasaki, 2022; Zhao et al., 2021), frequency-domain blending (Zhou et al., 2024; Kashiani et al., 2025) or feature-level augmentation (Yan et al., 2024b). However, the commonly adopted protocol, i.e., training on FF++ (Rossler et al., 2019) and testing on others (Dolhansky et al., 2019; Li et al., 2020b; Dolhansky et al., 2020; Zi et al., 2020; Zhou et al., 2021), suffers from two problems: (1) the training sources are overly narrow, and (2) the testing data exhibit limited forgery types and low-resolution. Although many timely datasets (Yan et al., 2024a; Zhang et al., 2024b; Li et al., 2025; Huang et al., 2025b; Wang et al., 2025a; Wen et al., 2025a; Xia et al., 2025) have been proposed for AIGC detection, the pace of deepfake detection has lagged behind. As a result, previous methods are biased towards such settings, exhibiting degraded generalization when learning from varying sources or mixed artifacts. To mitigate this problem, we introduce a hierarchical protocol in our HydraFake dataset, aiming to comprehensively reflect the generalization capability of the detectors.

2.2 MLLMs for Deepfake Detection

With the proliferation of MLLMs (Liu et al., 2023; Bai et al., 2025; Zhu et al., 2025), recent focus has shifted to explainable deepfake and AIGC detection. However, most methods still rely on small vision models for the final decision. For instance, M2F2-Det (Guo et al., 2025b) determines the authenticity purely based on CLIP models, where LLM is leveraged as a plug-in interpreter. Similarly, DD-VQA (Zhang et al., 2024a), FFAA (Huang et al., 2024) and VLF-FFD (Peng et al., 2025) develop post-processing system to aggregate embeddings from small vision models. Some methods (He et al., 2025; Sun et al., 2025; Chen et al., 2025b) attempt to directly adopt the outputs from LLMs, e.g., Sun et al. (Sun et al., 2025) construct precise forgery explanations to release the power of MLLMs. Recent methods (Huang et al., 2025a; Xu et al., 2024b; Zhou et al., 2025) also adopt MLLMs and curated datasets for AIGC detection. However, these methods generate post-hoc explanations by first determining the answer. The potential of reasoning abilities for deepfake detection is still underexplored. The most recent methods (Gao et al., 2025; Xia et al., 2025) explore the reasoning for AIGC detection, while neglecting adaptive reasoning patterns and is not tailored for facial forgery. Different from previous methods, we introduce human-like reasoning into deepfake detection, achieving promising improvements and delivering transparent decisions end-to-end.

3 HydraFake Dataset

In this part, we introduce our HydraFake dataset, including the construction process and evaluation protocol. Detailed statistics and information are provided in Appendix A.1.

3.1 Data Collection

Real Images. As shown in Figure 2 (a), the real images are collected from $8$ public datasets, containing both low-resolution (i.e., LFW (Huang et al., 2008), CelebA (Liu et al., 2015), FaceForensics++ (FF++) (Rossler et al., 2019), FFIW (Dolhansky et al., 2019)) and high-resolution images (i.e., FFHQ (Karras et al., 2019), VFHQ (Xie et al., 2022), UADFV (Yang et al., 2019) and CelebAHQ (Karras et al., 2017)). The collected images are rigorously partitioned for training and testing.

Fake Images. The fake images come from three sources:

•

Classic deepfake data sampled from FF++ (Rossler et al., 2019) DF40 (Yan et al., 2024d) and FFIW (Dolhansky et al., 2019), which mainly contain face swapping (FS) and face reenactment (FR) forgeries from $10$ generative models. The artifacts are mostly localized.

•

Public deepfake data sampled from WILD (Bongini et al., 2025), seeprettyface website and TalkingHeadBench (Xiong et al., 2025). This contains carefully synthesized faces from $16$ popular generators. However, there still exist corner cases such as fresh forgery types.

•

Advanced deepfake data where we further reimplemented and crawled $10$ K deepfake data from $10$ advanced generators. Besides traditional deepfake techniques, HydraFake dataset contains Face Restoration (Zhou et al., 2022), Face Relighting (Zhang et al., 2025b), Face Personalization (Jiang et al., 2025; Guo et al., 2024), Generative Face Swapping (Han et al., 2024) and deepfakes from Visual AutoRegressive models (VAR) (Han et al., 2025; Tang et al., 2024). To simulate real-world challenges, we also crawled $1$ K deepfake images from social media, which include practical deepfakes generated from commercial apps, including GPT-4o (Hurst et al., 2024), Dreamina (team, 2025a) and Hailuo AI (team, 2025b).

Quality Control. For classic deepfake datasets, we only select FF++ (Rossler et al., 2019) and FFIW (Zhou et al., 2021), while not involving DFDC (Dolhansky et al., 2020), DFDCP (Dolhansky et al., 2019) and WDF (Zi et al., 2020) due to their low quality (e.g., unexpected blurring in real images). For our self-constructed deepfake data, we conduct strict quality control, e.g., for face personalization, we use Qwen2.5-VL-72B to tailor sample-specific prompts rather than using template-like prompts as in (Bongini et al., 2025). For face relighting, we generate multiple lighting sources for each identity and manually select high-quality samples. After filtering and balancing, our HydraFake dataset contains $50$ K real images and $50$ K fake images.

3.2 Evaluation Protocol

Training. As shown in Figure 2 (b), the training set contains $48$ K images. Real images are from $5$ subsets, with other $3$ subsets left out for testing. Fake images involves $21$ subsets while only contains $3$ forgery types (i.e., FS, FR and EFG). This is to simulate practical setting, where abundant training images are available but various forgery types and generative models remain unseen.

Evaluation. The evaluation is divided into four distinct levels:

•

In-Domain (14K): testing images share the training data source but with different identities.

•

Cross-Model (11K): fake images are generated by unseen models under controlled conditions like the template-based textual prompts. This includes SOTA models from recent years (e.g., FLUX1.1-Pro (Black Forest Labs, 2024), Adobe FireFly (Adobe, 2023), Starry AI (AI, 2023)), distinct model architectures (e.g., VAR (Han et al., 2025; Tang et al., 2024) and Video AR model (Sand-AI, 2025)). The real images are from in-domain set but with different identities.

•

Cross-Forgery (12K): fake images are generated by unseen manipulation techniques, involving attribute editing, generative face swapping, IP-preserved personalization, face relighting and face restoration. The real images are from in-domain set but with different identities. This split is to evaluate the model’s capacity to detect fake images generated by unseen manipulation.

•

Cross-Domain (15K): fake images are either generated under controlled conditions or collected from the web, including both unseen forgeries and unseen models. The real images are from unseen datasets (i.e., VFHQ (Xie et al., 2022), UADFV (Yang et al., 2019) and FFIW (Dolhansky et al., 2019)). The images are of different qualities, posing strong challenges.

4 Method

In this section, we detail the two-stage training pipeline of Veritas, including pattern-guided cold-start and pattern-aware reinforcement learning, as shown in Figure 3.

4.1 Pattern-guided Cold-Start

To internalize thinking patterns for deepfake detection, we first employ a pattern-guided cold-start. Different from common practice, we involve two steps: Supervised Fine-Tuning (SFT) for format injection, and a Mixed Preference Optimization (MiPO) strategy to align the reasoning process.

SFT Pattern Injection. Suppose the SFT dataset is denoted as $\mathcal{D}_{1}=\{(\bm{q},\bm{s})_{i}\}_{i=1}^{N_{1}}$ , where $\bm{s}$ is the target output sequence including pattern-aware reasoning and final answer. $\bm{q}$ denotes input image and user query. The training objective maximizes the likelihood of generating $\bm{s}$ given input $\bm{q}$ :

[TABLE]

where $\pi_{\theta}$ denotes the token distribution from the current model. In the following we introduce the construction process of our training data $\mathcal{D}_{1}$ .

To minimize human costs, we use MLLMs for automated annotation, similar to recent practices (Huang et al., 2024; Xu et al., 2024b). However, this encounters two challenges in our case: (1) The MLLMs tend to overlook some subtle artifacts like abnormal optical focusing. (2) The model prioritizes producing logical paths than to accurately locating artifacts. To mitigate the two issues, we construct a multi-step annotation pipeline. We first manually inspect a subset and summarize a comprehensive artifacts taxonomy (Figure 9 (a)): (1) Perceptible structural anomalies, which are immediately visible and easy to detect. (2) Subtle low-level artifacts, which require careful inspection. (3) Cognitive violations of physical laws, which are implicit and require connecting to common sense or real-world knowledge. Then, we decouple the annotation into three specialized yet coherent steps (Figure 9 (b)), resulting in $36$ K samples for $\mathcal{D}_{1}$ . Detailed process and all prompt templates are provided in Appendix A.4. Annotated examples are presented in Figure 9 (c).

MiPO Reasoning Alignment. To further facilitate human-aligned reasoning, we meticulously curate a mixed preference dataset $\mathcal{D}_{2}=\{(\bm{q},\bm{s}_{w},\bm{s}_{l}^{\phi})_{i}\}_{i=1}^{N_{2}}\cup\{(\bm{q},\bm{s}_{w},\bm{s}_{l}^{\psi})_{i}\}_{i=1}^{N^{\prime}_{2}}$ . Specifically, we collect two types of non-preference data for the fake images: (1) the trajectories where the answer is correct but the reasoning content is not precise or detailed enough (i.e., $\bm{s}_{l}^{\phi}$ ). (2) the trajectories where the answer is incorrect (i.e., $\bm{s}_{l}^{\psi}$ ). $\bm{s}_{w}$ denotes preferred reasoning traces, which are precisely annotated by our human experts. Both $\bm{s}_{l}^{\phi}$ and $\bm{s}_{l}^{\psi}$ are sampled from the outputs of the SFT model, yielding $3$ K high-quality paired samples for dataset $\mathcal{D}_{2}$ . Note that the images in $\mathcal{D}_{2}$ strictly come from the in-domain training set, without introducing any OOD samples. Suppose the SFT model is denoted as $\pi_{\theta_{\text{SFT}}}$ , the training objective for MiPO is formulated as:

[TABLE]

where $\sigma(\cdot)$ denotes the sigmoid function and $\beta$ controls the strength that the model deviates from the reference model. As shown in Figure 1, by learning from such mixed rejected traces, our model can perform more precise and fine-grained reasoning compared to pure SFT cold-start.

4.2 Pattern-Aware Exploration

After cold-start, the trained model possesses the fundamental reasoning capacities for deepfake detection. However, it still fails on more challenging samples. To mitigate this, we introduce Pattern-Aware GRPO (P-GRPO) to encourage the model to perform comprehensive reasoning and potential self-reflection. Unlike recent approaches (Tu et al., 2025; Xiao et al., 2025) that encourage adaptive reasoning through length reward, we suppose that absolute reasoning length is not critical. Instead, we incentivize appropriate thinking patterns through the pattern-aware reward mechanism.

Suppose the training data for P-GRPO is $\mathcal{D}_{3}=\{(\bm{q},\bm{a})_{i}\}_{i=1}^{N_{3}}$ , where $\bm{a}$ denotes the binary answer. We randomly sampled $9$ K images from in-domain training set. For a given query $\bm{q}$ , P-GRPO samples $G$ responses $\{o_{1},o_{2},...,o_{G}\}$ using the current policy model $\pi_{\theta_{\text{old}}}$ . The quality of each response $\{R_{1},R_{2},...,R_{G}\}$ is evaluated through reward functions. Suppose the cold-started model $\pi_{\theta_{\text{cold}}}$ is adopted as reference policy, the training objective is formulated as:

[TABLE]

where

[TABLE]

The reward $R_{i}$ for each response is evaluated from three perspectives:

Pattern-aware Reward. Suppose $\mathcal{C}\in\{0,1\}$ represents the correctness of the final answer, with $\mathcal{C}=1$ denoting the answer is right. $\mathcal{P}\in\{0,1\}$ and $\mathcal{R}\in\{0,1\}$ represents whether the reasoning involves “planning” and “self-reflection”, respectively. The pattern-aware reward is defined as:

[TABLE]

Specifically, we encourage the model to reach correct answers through planning and self-reflection by assigning a larger reward (i.e., $2.0$ ) if they are involved in the reasoning process. However, if these patterns lead to incorrect answers, we impose a penalty for its overthinking. Since self-reflection is a more decisive pattern, we assign a larger penalty (i.e., $-1.0$ ) for errors resulting from it.

Reflection Quality and Format Reward. To facilitate meaningful self-reflection, we assess the quality of reflection by an external model $\mathcal{M}$ : $R_{\text{ref}}=\mathcal{M}(\bm{S})$ . The criterion is the originality of the reflection, i.e., whether it introduces new perspectives rather than restating prior discoveries. The model only obtains $R_{\text{ref}}$ when the answer is correct. For format reward $R_{\text{fmt}}$ , we predefine some combinations of reasoning patterns and set $R_{\text{fmt}}=1$ when the response conforms to valid formats.

Suppose $\mathbb{I}(\cdot)$ is the indicator function. The final reward $R$ for each response is defined as:

[TABLE]

In practice, given that only verifiable answers are required, the training data $\mathcal{D}_{3}$ can be freely expanded. Our cold-start model serves as a solid reasoning foundation, upon which the community can utilize custom data with P-GRPO to achieve more powerful reasoning model for deepfake detection.

5 Experiments

5.1 Experimental Setup

State-of-the-Art Methods. We trained $10$ state-of-the-art (SOTA) detectors on our dataset, including F3Net (Qian et al., 2020), UniFD (Ojha et al., 2023), IID (Huang et al., 2023), FreqNet (Tan et al., 2024a), ProDet (Cheng et al., 2024), NPR (Tan et al., 2024b), AIDE (Yan et al., 2024a), Co-SPY (Cheng et al., 2025), D3 (Yang et al., 2025b), Effort (Yan et al., 2024c). We also assess $4$ open-source MLLMs of similar size to our model, including Qwen2.5-VL-7B (Bai et al., 2025), InternVL3-8B (Zhu et al., 2025), MiMo-VL-7B (Team, 2025) and GLM-4.1V-9B-Thinking (Hong et al., 2025), along with $2$ powerful closed-source models GPT-4o (Hurst et al., 2024) and Gemini-2.5-Pro (Comanici et al., 2025). Besides, we evaluate recent MLLM-based forgery detectors, including FakeShield (Xu et al., 2024b), M2F2-Det (Guo et al., 2025b), SIDA (Huang et al., 2025a), FakeVLM (Wen et al., 2025c), FFAA (Huang et al., 2024). More details are in Appendix A.2.

Metrics. Following previous works (Zhang et al., 2024a; Guo et al., 2025b), we take Accuracy (Acc) to measure the model performance. Precision and Recall are reported in Appendix A.5.

Implementation Details. We implement **Veritas ** with InternVL3-8B (Zhu et al., 2025). For the cold-start SFT, we train the model for $3$ epochs using LoRA (Hu et al., 2022) (rank=128, $\alpha$ =256). The learning rate is set to $5\times 10^{-5}$ , with a batch size of $64$ . For cold-start MiPO, the model is trained for $2$ epochs with the same setting of SFT. For P-GRPO, we further train the model for $2$ epochs with the same LoRA setting. The learning rate is set to $1\times 10^{-6}$ with a batch size of $16$ . $G$ is set to $4$ , with a temperature of $1.0$ . $\beta$ and $\beta^{\prime}$ are set to [math]. We take UnifiedReward-Qwen-3B (Wang et al., 2025d) as the reward model $\mathcal{M}$ . For each stage, we directly adopt model from the last step.

5.2 Main Results

Comparison to SOTA detectors. As shown in Table 1, our **Veritas **model achieves SOTA performance on four evaluation scenarios, achieving 6.0% averaged gains over the previous best. Existing detectors show great performance on cross-model split (over 90% for D3) but fall short on cross-forgery and cross-domain scenarios (mostly less than 85%). **Veritas ** mitigates the gap, achieving over 90.0% accuracy on unseen forgery such as face restoration and personalization, and over 90.0% on in-the-wild data from Dreamina and 89.2% on GPT-4o. The cold-start model also achieves promising results, but without incentivizing planning and self-reflection, the cross-forgery results are degraded. More results and analyses can be found in Appendix A.5.

Comparison to SOTA MLLMs. Compared to our base model, **Veritas **achieves 32.4% averaged gain, suggesting the effectiveness of our training strategy. The models with similar sizes show limited abilities for deepfake detection, with less than 60% accuracy. Gemini-2.5-Pro shows the best capacities among these MLLMs, even outperforming some of the fine-tuned detectors. **Veritas ** surpasses Gemini-2.5-Pro by 11.8%, demonstrating great generalization.

Comparisons to MLLM-based detectors. For fair comparisons, we restrict our training scope (Table 1). Even with limited data scope, **Veritas-mini ** still outperforms existing MLLM-based detectors, indicating the effectiveness of the proposed framework. M2F2-Det and FFAA, though targeted at deepfake detection, suffer from poor generalization on HydraFake. SIDA-7B and FakeVLM achieve promising results by contrast. Moreover, **Veritas ** exhibits certain advantages in both detection accuracy and reasoning depth (Figure 6). More cases can be found in Appendix A.8.

5.3 Ablation Studies

We provide primary ablations in main text. More analyses on the training protocol (A.5.2), results on recent benchmark (Li et al., 2025)(A.5.2), selection of P-GRPO training data (A.5.4) and reward model (A.5.6), hyperparameters (A.5.5) and efficiency analysis (A.5.3) can be found in Appendix.

Effect of pattern-aware reasoning. As shown in Table 2, we compare different reasoning paradigms using SFT and P-GRPO training. Although the improvements on in-domain datasets are marginal, our pattern-aware reasoning demonstrates clear advantages to flexible reasoning on OOD scenarios, achieving 6.2% and 3.3% gains on CF and CD testing respectively. The post-hoc explanation adopted in recent methods exhibits degraded performance in OOD testing, further verifying the superiority of pattern-aware reasoning.

Ablations on different training stages. As shown in Figure 5, we investigate the effect of each training stage. Applying MiPO or P-GRPO upon SFT model both achieve significant gains, with P-GRPO performing better, which is due to the online sampling and pattern-aware incentivization. Applying MiPO before P-GRPO yields the best performance, achieving 2.9% and 2.1% gains on CF and CD testing respectively. This is because MiPO ensures high-quality rollouts in subsequent stage, facilitating more accurate policy updates for online RL.

Effect of Pattern-guided Cold-Start. As shown in Figure 5, we investigate different RL settings without cold-start. The training data keeps consistent with our two-stage pipeline. Answer-only model achieves better ID results while incorporating thinking improves CM and CD performance. However, all settings underperform the model with cold-start. The low-quality explorations lead to unstable training. Results in Figure 5 further verify the effectiveness of MiPO during cold-start.

Effect of Pattern-aware GRPO. As shown in Table 3, our P-GRPO achieves noticeable improvements compared to original GRPO. Specifically, pattern-aware reward outperforms the vanilla accuracy reward especially on CF and CD scenarios. The reflection quality reward benefits both original GRPO and our P-GRPO, which demonstrates the importance of high-quality self reflection. In Appendix A.5.4, we observe that by adding several “unseen“ data in P-GRPO, the ODD performance can be further improved, demonstrating promising scalability with only binary labels required.

Effect of specific reasoning patterns. As shown in Table 5, “fast judgement” is helpful for CF and CD, but is not critical overall. “planning” is more effective on CM, since the fully synthesized images require a more holistic and structured analysis. “self-reflection” is critical especially on CF and CD, as it incentivizes the model to discover those unseen artifacts. “conclusion” provides certain gains, suggesting that synthesizing separate evidence into a coherent verdict is also important.

Ablations on the non-preference in MiPO. As shown in Table 6, $\bm{s}_{l}^{\phi}$ helps improve the performance on CF (+1.3%) and CD (+0.8%) scenarios. To understand the effects, we provide a qualitative case in Figure 20. Without $\bm{s}_{l}^{\phi}$ , the model still gets correct answers, but the analysis is superficial and less detailed, which causes certain failures on unseen forgeries that might require in-depth reasoning.

5.4 Further Analyses

Evaluation of reasoning quality. To evaluate the reasoning quality, we take two types of assessments: (1) score evaluation which is based on predefined criteria (Figure 33). (2) Pairwise comparison which directly compares outputs from two models. We adopt MLLM-as-a-Judge (Chen et al., 2024a), using GPT-4o and Gemini-2.5-Pro for evaluation. Similar to (Zhou et al., 2025), we randomly select $1$ K samples for evaluation. As shown in Table 7, our model achieves the best score and ELO rating, where MiPO greatly improves the reasoning quality. Moreover, our MiPO outperforms DPO in raising reasoning quality, which verifies the effectiveness of mixed non-preference strategy.

Different fine-tuned base models and model sizes. As shown in Table 4, we adopt different MLLMs as our base model. InternVL3-8B outperforms Qwen2.5-VL-7B and MiMo-VL-7B, due to the dynamic high resolution strategy. InternVL3-2B achieves promising performance with fewer parameters, while scaling up to 14B yields considerable gains on CM and CF scenarios.

Robustness evaluation. We investigate the performance under JPEG compression and Gaussian blur. Results in Table 8 highlight the robustness of our model. Our model achieves consistently high performance under JPEG compression and maintains state-of-the-art results across different perturbations. Notably, this robustness is achieved without training on corresponding data augmentations such as random Gaussian blur, which instead are commonly adopted in previous methods.

6 Conclusion

In this paper, we introduce HydraFake dataset and **Veritas ** model. HydraFake introduces a holistic evaluation protocol to comprehensively measure the generalization capacities. We then train a multi-modal large language model (MLLM) based deepfake detector trained with our two-stage pipeline. Results on HydraFake show that current detectors struggle on cross-forgery and cross-domain scenarios, while our model greatly mitigates the gap and is capable of delivering transparent decision process. We hope this work can inspire more generalizable and reliable deepfake detection.

Acknowledgments

This work was supported by the Beijing Natural Science Foundation JQ23016, the Chinese National Natural Science Foundation Projects 62476273, 62406320, 62276254 and U23B2054, the Science and Technology Development Fund of Macau Project 0123/2022/A3, 0140/2024/AGJ, 0044/2024/AGJ and 0084/2024/RIB2, and Ant Group.

Appendix A Appendix

The appendix is organized as follows:

$\bullet$ §A.1 Details of HydraFake Dataset.

•

§A.1.1 Training Set.

•

§A.1.2 In-Domain Evaluation.

•

§A.1.3 Cross-Model Evaluation.

•

§A.1.4 Cross-Forgery Evaluation.

•

§A.1.5 Cross-Model Evaluation.

$\bullet$ §A.2 More Implementation Details.

$\bullet$ §A.3 More Discussions.

$\bullet$ §A.4 Multi-Step Annotation Pipeline.

$\bullet$ §A.5 More Experimental Results.

•

§A.5.1 More Results and Analyses on HydraFake.

•

§A.5.2 Cross Benchmark Comparison.

•

§A.5.3 Efficiency Comparison.

•

§A.5.4 Effect of Training Data in P-GRPO Stage.

•

§A.5.5 Analysis of Hyperparameters.

•

§A.5.6 Effect of Different Reward Model.

$\bullet$ §A.6 Full Prompt Templates.

$\bullet$ §A.7 More Qualitative Results.

$\bullet$ §A.8 More Qualitative Comparisons with Existing MLLM-based Detectors.

$\bullet$ §A.9 Failure Analysis of Veritas.

$\bullet$ §A.11 Ethics Statement.

$\bullet$ §A.12 Limitations and Future Work.

A.1 Details of HydraFake Dataset

In this section, we provide more details about our HydraFake Dataset. We introduce the dataset from the perspective of training and evaluation protocols.

A.1.1 Training

Real Images. HydraFake dataset contains real images from $8$ public datasets. We extract $5$ subsets as the training set, containing $3$ low-quality subsets and $2$ high-quality subsets. The low-quality images include FF++ (Rossler et al., 2019), CelebA (Liu et al., 2015) and LFW (Huang et al., 2008). The high-quality images include FFHQ (Karras et al., 2019) and CelebAHQ (Karras et al., 2017). This results in $24$ K real images for training.

Fake Images. In practical scenario, there are abundant fake images for training, while these images have two attributes: (1) the quality of the images varies greatly, and (2) the forgery types are often limited. To mimic such setting, we extract $21$ subsets as the training set and strictly control the seen forgeries. We only include face swapping (FS), face reenactment (FR) and entire face generation (EFG) in our training set, leaving various forgery types unseen. Moreover, the deepfake methods in our training set are not the latest, leaving fresh methods in the evaluation.

•

FS: FF++, BlendFace, FSGAN, SimSwap, FaceDancer, MobileSwap.

•

FR: FF++, Facevid2vid, Hallo, Hallo2, LivePortrait, AniPortrait, EmoPortrait

•

EFG: Dall-e 1, StyleGAN, StyleGAN2, VQGAN, Midjourney, Seeprettyface, Stable Cascade, Stable Diffusion XL, Attend-and-Excite.

A.1.2 In-Domain Evaluation

For in-domain testing, we select $5$ subsets from the training set, and use the unseen identities as the testing samples. Specifically, we make a balance on the image quality and forgery types, choosing a FS dataset FF++ (low-quality), two FR datasets Facevid2vid (low-quality) and Hallo2 (high-quality), and two EFG datasets StyleGAN (Karras et al., 2019) (high-quality) and Midjourney (high-quality). For low-quality subsets, the real images are sampled from FF++. For high-quality subsets, the real images are sampled from FFHQ.

A.1.3 Cross-Model Evaluation

The cross-model testing images come from deepfakes generated using unseen models. While the difficulty varies according to the model architectures. This includes $6$ subsets, i.e., Adobe Firefly (Adobe, 2023), MAGI-1 (Sand-AI, 2025), Flux1.1 Pro (Black Forest Labs, 2024), StarryAI (AI, 2023), Infinity (Han et al., 2025) and HART (Tang et al., 2024).

Infinity. Infinity is a Bitwise Visual AutoRegressive Model (VAR) capable of generating high-resolution and photorealistic images from textual prompts. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer and bitwise self-correction. We reproduce the Infinity-8B model, which is capable of generating $1024\times 1024$ images. The control prompts are generated using Qwen3-32B model, which leverages a template-like sentence, balances between gender and age and avoids any semantic conflicts (e.g., wrinkles on the little girl’s face) with the help of LLM.

HART. Hybrid Autoregressive Transformer (HART) is an autoregressive (AR) visual generation model capable of directly generating $1024\times 1024$ images, rivaling diffusion models in image generation quality. It contains a hybrid image tokenizer to improve the fidelity of the generated images. We reproduce HART model, which is based on Qwen2-VL-1.5B. The textual prompts are consistent with Infinity.

Adobe Firefly. Adobe Firefly is a proprietary suite of multimodal generative AI models developed by Adobe. It is built upon a deeply customized and optimized diffusion model. It exclusively utilizes the Adobe Stock library, open-licensed content, and public domain works, thereby designed to ensure the commercial viability and mitigate copyright risks of its generated outputs. We collect this subset from WILD (Bongini et al., 2025), which is generated using template-like textual prompts.

StarryAI. Starry AI is an advanced generative artificial intelligence model engineered for high-fidelity text-to-image synthesis. Its core architecture integrates a transformer-based encoder for the semantic interpretation of textual prompts with a latent diffusion model for the iterative synthesis of visual content. The model excels at translating complex, abstract descriptions into visually coherent and stylistically nuanced imagery. We collect this subset from WILD (Bongini et al., 2025), which is generated using template-like textual prompts.

MAGI-1. MAGI-1 is an autoregressive denoising video generation model (Video AR) generating videos chunk-by-chunk instead of as a whole. It excels in generating high-quality, temporally consistent videos from text or image prompts. With support for large-scale model sizes and long context lengths, it is well-suited for a wide range of creative and generative video applications. We collect this subset from TalkingHeadBench (Bongini et al., 2025).

Flux1.1 Pro. Flux1.1 Pro is currently the most advanced model of Flux series, which is introduced by Black Forest Labs. It is designed for fast, high-resolution and realistic text-to-image generation. We collect this subset from WILD (Bongini et al., 2025), which is generated using template-like textual prompts.

A.1.4 Cross-Forgery Evaluation

The cross-forgery testing images come from deepfakes generated by unseen forgeries. With the rapid development of generative techniques, novel types of forgery are constantly emerging, such as portrait relighting (Zhang et al., 2025b) and IP-preserved personalization (Jiang et al., 2025; Guo et al., 2024). To assess the model’s generalization capacities when encountering these emerging deepfake methods, we collect $5$ representative forgery methods in our dataset, including face relighting (Zhang et al., 2025b), face restoration (Zhou et al., 2022), generative face swapping (Han et al., 2024), facial attribute editing (Choi et al., 2020) and face personalization (Jiang et al., 2025; Guo et al., 2024).

Face Relighting. The method is based on IC-Light (Zhang et al., 2025b), which is an emerging ability in the generative models and is becoming prevailing. “IC-Light” means “Imposing Consistent Light”, which is capable of adjusting the lighting sources and intensity in the image while keeping the subject highly unchanged. The condition is based on textual prompts. We sampled real images from FFHQ (Karras et al., 2019) and reproduced IC-Light to change the lighting condition of these real images. We implemented $10$ lighting types (e.g., “sunshine from window”, “soft studio lighting” and “neon light in city”) and $4$ lighting sources (i.e., “left”, “right”, “top” and “bottom”). We use multiple seeds for each condition, and then manually filter out those of low quality.

Face Restoration. The method is based on CodeFormer (Zhou et al., 2022), which can recover low-quality (e.g., blurred) natural faces to high-quality counterparts, even when the inputs are severely degraded. It can generate high-quality faces while maintaining the fidelity. In fact, this is a helpful technique that has positive usage in many domains. But considering this can be also used for low-quality deepfake images, we take this as an unseen forgery in our dataset. Specifically, we sampled some low-quality fake images from DF40 (Yan et al., 2024d) and TalkingHeadBench (Xiong et al., 2025), and then employ CodeFormer to restore them into $512\times 512$ images.

Facial Attribute Editing. Facial attribute editing is a common manipulation, involving altering facial attributes such as hairstyle and makeup. In our dataset we leave this type out for testing. We collect images generated by StarGANv2 (Choi et al., 2020) from DF40 (Yan et al., 2024d).

IP-preserved Face Personalization. IP-preserved face personalization technology enables the generation of synthetic faces that closely retain the distinctive visual attributes of original intellectual property (IP). By producing highly realistic and IP-consistent deepfakes, it can facilitate unauthorized exploitation or impersonation of protected characters and personalities. With the advancement of generative models, face personalization techniques are now capable of maintaining high-fidelity while following complex contextual and subject-specific instructions (e.g., transforming an ID photo into an image of the singer in the bar). We reproduce two timely methods PuLID (Guo et al., 2024) and InfiniteYou (Jiang et al., 2025). We sample real images from FFHQ as the source images. To enhance the realism and semantic coherence of face personalization, we employ Qwen2.5-VL-72B to generate customized prompts for each image.

Generative Face Swapping. The face swapping data in existing datasets are often produced by conventional approaches such as graphics-based methods or GAN models. While nowadays the generative-based methods are capable of generating high-fidelity swapped faces, which are based on Diffusion models. Considering the latest methods such as DreamID (Ye et al., 2025a) and DynamicFace (Wang et al., 2025c) are not open-sourced yet, we implemented FaceAdapter (Han et al., 2024), which produces high-quality swapping data. We will keep tracking the advancements in these methods and update our dataset. The source faces are sampled from FFHQ. We manually filter out those low-quality generated images, only maintaining high-fidelity samples.

A.1.5 Cross-Domain Evaluation

The “domain” in our dataset mainly refers to data source. For instance, the cross-forgery data are generated using in-domain real images from FFHQ, which alters the manipulation methods but keeps the data source unchanged. But for cross-domain testing, the fake images are either generated from unseen real data srouce or entirely generated by commercial models. And we also crawled fake images from social media, which serves as a challenging cross-domain evaluation. Specifically, our cross-domain testing can be clustered into three types: (1) classic datasets, including DeepFaceLab from DF40 (Yan et al., 2024d) and FFIW (Dolhansky et al., 2019) which is widely adopted in existing benchmarks. (2) Reproduced deepfakes, including face personalization generated using real images from VFHQ (Xie et al., 2022). (3) In-the-wild deepfakes, where we collect data from social media such as Xiaohongshu and TikTok. We retrieved images through the tags of the posts and collected the images generated by GPT-4o (Hurst et al., 2024) and Dreamina (team, 2025a), and cropped out the digital watermarks. We further generate deepfake videos using Hailuo AI (team, 2025b), and extract $8$ frames for each video.

A.2 More Implementation Details

Training resources. Our model is trained with $8$ PPUE GPUs based on ms-swift (Zhao et al., 2024). The theoretical peak computational capacity (TFLOPS) of one PPUE GPU is roughly half of an NVIDIA A100 GPU, and each PPUE GPU has $96$ GB VRAM. With such infrastructure, the SFT and MiPO stage take $5.5$ hours and $2$ hours, respectively. The P-GRPO stage takes $11$ hours on $9$ K training samples. All the inferences are conducted on a single PPUE GPU.

Training details of previous methods. For previous SOTA methods, we reproduce them based on DeepfakeBench (Yan et al., 2023). For F3Net (Qian et al., 2020), UniFD (Ojha et al., 2023), IID (Huang et al., 2023), FreqNet (Tan et al., 2024a), ProDet (Cheng et al., 2024), NPR (Tan et al., 2024b) and Effort (Yan et al., 2024c), we reproduce them based on DeepfakeBench (Yan et al., 2023). For AIDE (Yan et al., 2024a), Co-SPY (Cheng et al., 2025) and D3 (Yang et al., 2025b), we train the model with official codes and perform inference using DeepfakeBench. The images are first randomly cropped into $256\times 256$ and then resized to $224\times 224$ . For Co-SPY (Cheng et al., 2025), the images are resized to $384\times 384$ following the official implementation. We apply a series of data augmentations during training, including random flipping, rotation, gaussian blur, brightness and contrast alternation, color jitter and JPEG compression. Following official guides, AIDE and D3 are trained for $100$ epochs. FreqNet and NPR are trained for $50$ epochs. The first stage of Co-SPY (i.e., artifacts and semantic encoders) are trained for $20$ epochs, and the second stage (i.e., combination) is trained for another $10$ epochs. Effort is trained for $10$ epochs and other methods are trained for $20$ epochs. During testing, the images are resized to $224\times 224$ . For all these methods, we curate a validation set containing 4K in-domain images for model selection.

Other details. The training data from all stages are strictly sampled from HydraFake training set. The $36$ K SFT data are randomly sampled and balanced across forgery types. The $3$ K MiPO pairs are selected based on SFT models’ outputs. $800$ images that the SFT model fails to reach all correct answers under $8$ rollouts are selected. Each image is paired with $4$ manually selected non-preference chains (from SFT model’s outputs) and $1$ manually annotated preference chain, resulting in 3K samples for MiPO. The $9$ K P-GRPO data are randomly sampled and balanced across forgery types. For open-sourced MLLMs, we provide prior knowledge and instruct the model to perform thinking in the prompts. The full prompts are provided in Figure 38, Figure 39 and Figure 40. For Gemini-2.5-Pro, we enable thinking and searching. $\lambda_{1}$ and $\lambda_{2}$ are set to $1.0$ and $0.25$ , respectively. The valid output formats for $R_{fmt}$ in P-GRPO are listed in Figure 8.

A.3 More Discussions

Difference between explainable and reasoning deepfake detection. In this part, we formulate different task settings of MLLM-based deepfake detection. Given the input image $\bm{I}$ , deepfake detection aims to determine its authenticity $\mathcal{Y}\in{\{0,1\}}$ , where $1$ means the image is fake and vice versa. Suppose the input image and query are collectively denoted as $\bm{q}$ . The sequential outputs of MLLM are denoted as $\bm{s}=\{\bm{s}_{1},\bm{s}_{2},...,\bm{s}_{T}\}$ , where $T$ is sequence length. The conditional probability of sequence $\bm{s}$ is written as $P(\bm{s}|\bm{q})=\prod_{t=1}^{T}P(\bm{s}_{t}|\bm{q},\bm{s}_{<t})$ .

Recent works (Chen et al., 2024b; He et al., 2025) utilize MLLM for explainable deepfake detection, where LLM first generates the answer $\bm{s}_{\mathcal{A}}$ (e.g., “fake” or “this image is fake”). Then a detailed explanation sequence $\bm{s}_{\mathcal{E}}$ is generated based on $\{\bm{q},\bm{s}_{\mathcal{A}}\}$ (where $\mathcal{A}\ll\mathcal{E}$ ). For simplicity, we suppose the answer ${\bm{s}_{\mathcal{A}}}$ is in a single word. The process can be decomposed into:

[TABLE]

where the final decision (i.e., $\prod_{t=1}^{\mathcal{A}}P(\bm{s}_{t}|\bm{q})$ ) is solely conditioned on the input image. This is fundamentally similar to the small vision models, where the distributional mapping $f:\bm{I}\rightarrow\mathcal{Y}$ is estimated directly within a single token, prone to overfitting.

In contrast, we formulate the deepfake detection as a reasoning task. The MLLM first conduct holistic reasoning, denoted as $\bm{s}_{\mathcal{R}}$ , and the final answer is then determined in $\bm{s}_{\mathcal{A}}$ :

[TABLE]

where the final answer (i.e., $\prod_{t=\mathcal{R}+1}^{\mathcal{R}+\mathcal{A}}P(\bm{s}_{t}|\bm{q},\bm{s}_{<\mathcal{R}+\mathcal{A}})$ ) is building on inputs and reasoning process. The mapping is altered into $f^{\prime}:\bm{I}\rightarrow\mathcal{R}\rightarrow\mathcal{Y}$ , enabling more comprehensive and adaptive modeling.

Large Reasoning Models. Large Language Models (LLMs) are inherently good reasoners for general tasks. A simple prompt engineering could activate the reasoning behaviors of LLMs (Wei et al., 2022; Kojima et al., 2022), which is termed as Chain-of-Thought (CoT). Building on the impressive capabilities of CoT, the community began to explore large-scale and structured reasoning, leading to powerful reasoning models (Jaech et al., 2024; Guo et al., 2025a; Xu et al., 2024a) through tailored post-training. For instance, DeepSeek-R1 (Guo et al., 2025a) adopts reasoning cold-start followed by Reinforcement Learning with Verifiable Rewards (RLVR) to incentivize the general reasoning capabilities. Inspired by the success of RLVR, recent works (Liu et al., 2025; Zhang et al., 2025a) attempt to introduce pure rule-based RL into multimodal domain. However, a recent study (Yue et al., 2025) points out that pure RLVR can not introduce novel abilities to base model. We also empirically reveal the suboptimal performance achieved by pure RL. Therefore, we first introduce a high-quality cold-start (Liao et al., 2025; Chen et al., 2025a; Team et al., 2025) to internalize the thinking patterns to the base model. Unlike general tasks that require diverse and flexible thinking patterns (Zhan et al., 2025), deepfake detection is a well-defined task. Therefore, we establish a unified reasoning framework to facilitate efficient thinking.

Rationale for reasoning in deepfake detection. A possible concern is that even humans can fail on some highly realistic deepfakes. Applying reasoning for deepfake detection also faces the same problem. However, we point out that there is a physiological limit on human perception. The human excels at high-level semantic reasoning, such as judging contextual plausibility, but is less equipped to detect subtle, low-level digital artifacts like subtle blurriness or unnatural texture patterns. In contrast, the machine can be trained to perceive these subtle artifacts with superhuman accuracy. The primary challenge, which traditional detectors face, is not perception but generalization, i.e., they tend to overfit to specific artifact patterns. This is where the pattern-aware reasoning becomes crucial. Our goal is not to mimic a human’s intuitive guess, but to emulate a forensic expert’s systematic investigation. Our approach uniquely combines the machine’s superhuman perception with a structured and human-like reasoning framework (e.g., planning, reasoning, self-reflection, conclusion). As illustrated in Figure 21, the model can perceive subtle artifacts (e.g., barely noticeable blurriness and faint texture anomalies) that are nearly imperceptible to the human. Therefore, reasoning is crucial not to replicate the fallible human eye, but to provide a logical structure that effectively leverages the model’s perceptual abilities for robust generalization.

A.4 Multi-Step Annotation Pipeline of SFT data

As shown in Figure 9, for fake images we divide the annotation process into three steps:

Step-I: Based on human inspection, we found that most fake images exhibit $2$ to $3$ types of artifacts, hence we aim to find the two most prominent artifacts. To reduce model bias, we employ an ensemble voting strategy, leveraging Qwen2.5-VL-72B (Bai et al., 2025), Kimi-VL-A3B-Thinking (Team et al., 2025) and InternVL3-78B (Zhu et al., 2025) to sample 5 times individually. Only the answers that receive more than 10 votes are selected, ensuring the reliability of the final results.

Step-II: In this stage, we aim to extract visual details that conform to the identified artifacts. We found that Qwen2.5-VL-72B performs better than GPT-4o (Hurst et al., 2024) here, capable of generating detailed and factual responses. This yields concrete explanatory texts, like recent practice (Wen et al., 2025c; Gao et al., 2025; Zhang et al., 2025c). However, such plain explanations lack human-like reasoning logic, which hinders the generalization to OOD samples.

Step-III: To emulate human mindset, we further transform the above explanations into logical chains. We define five “thinking patterns” and instruct the model to rewrite the explanations into different tags strictly based on the original meaning. We observed that large reasoning models are inherently adept at generating highly logical content. Therefore, we use Qwen3-235B-A22B (Yang et al., 2025a) for this step, yielding high-quality reasoning data. Finally, the data undergo a filtering process, which involves rule-based filtering and balancing among different forgery types.

For real images, we only use the last two steps (without the need for anomalies detection). For Step-I we provide the full artifacts list to Qwen2.5-VL-72B for comprehensive visual facts forensics. For Step-II we adopt Qwen3-235B-A22B to convert the explanation texts into pattern-aware reasoning chain. Specifically, for some low-quality real images, which may contain some misleading artifacts like unexpected blurriness or missing visual details, we instruct the model to point this out and put them in the “self-reflection” content. However, the model is not capable of directly perceive such minor artifacts especially when told the image is authentic. Hence, we provide a rough difficulty information based on dataset level and encourage the model to perform self-reflection on those difficult images. This can mitigate the problem while it can not be fully addressed due to the significant loss of details in those low-resolution images.

A.5 More Experimental Results

A.5.1 More Results and Analyses on HydraFake

We provide full experimental results for all subsets. To help better understanding the model performance, we report Precision and Recall for each subset, with fake being the positive label.

The in-domain results are shown in Table 10. Effort achieves the best results among previous detectors. It is worth noting that, under the mixed training sources, previous methods even struggle on in-domain datasets, i.e., most methods achieve less than 90% average performance. This is mainly due to the degraded performance on low-resolution datasets such as FF++ and Facevid2vid. We suppose this is a major deficiency in current deepfake detectors, which tend to bias towards image resolutions. Our model achieves better performance, achieving over 99.5% on those high resolution images, while the results on low resolution subsets still have room for improvement.

The cross-model results are shown in Table 11. D3 achieves the best results among previous detectors. Lots of previous methods achieve good performance on cross-model scenarios, with averaged performance greater than 90%, such as D3, ProDet, Co-SPY and Effort. These models show great performance on cross-model data, especially on VAR architectures, achieving over 95% accuracy. Our model demonstrates extraordinary generalization performance on cross-model data, achieving almost 99% accuracy, while the recall capacity on proprietary models (e.g., Adobe Firefly and StarryAI) still has room for improvement.

The cross-forgery results are shown in Table 12. Effort achieves the best performance among previous detectors. The performance of most detectors are limited. For instance, those detectors tailored for deepfake facial images (i.e., IID and ProDet), showing extremely limited recall capacities when encountering facial attribute editing and face relighting. Most methods achieve moderate performance on face restoration and personalization. These results verify that current detectors exhibit limited abilities to generalize unseen forgeries. Besides, it is worth noting that Effort achieve excellent performance on face relighting, greatly surpassing our method. We suppose the reason is that Effort freezes CLIP’s semantic encoder, allowing the model to focus solely on detecting whether an image has been manipulated, which is critical in detecting relighting where the identities and other semantics are largely unchanged.

The cross-domain results are shown in Table 13. Co-SPY achieves the best results among previous detectors. Most methods including ours achieve degraded performance. Specifically, previous methods almost fail on all cross-domain subsets, while our method still achieves robust performance on in-the-wild forgeries (e.g., 92.3% on Dreamina and 89.2% on GPT-4o). The performance on DeepFaceLab is extremely limited. Different from cross-forgery and cross-model scenarios, the poor performance is due to low Precision. Similar problem also exists in previous detectors. This means those unseen low resolution real images are hard for model to distinguish. Overall, our model strikes great improvements on cross-domain scenarios.

Besides, the zero-shot MLLMs tend to classify facial images into real photographs. Even GPT-4o fails on many cases, exhibiting extremely low recall (i.e., less than 10%). Gemini-2.5-Pro demonstrates strong capacities for deepfake detection, especially on high-resolution images, even beating most fine-tuned specialized detectors. Aggregating the above observations, we can find a intutive but interesting phenomenon: the MLLM-based detectors (especially reasoning MLLMs) are good at analyzing high-resolution images (typically over $512\times 512$ ) but fall short on low-resolution counterparts. Once the MLLMs can “see” the image details, they are able to make accurate judgments and provide human-aligned reasoning process. Conversely, small vision models exhibit certain advantages on low-resolution images, as such data is more suited for distribution modeling. Therefore, a collaborative system of MLLMs and small models could be a promising future direction.

A.5.2 Cross Benchmark Comparison

In Table 14, we provide a cross benchmark comparison. We select the facial subsets from AIGIBench (Li et al., 2025). Note that these subsets also remain unseen in our HydraFake training set, which serves as an OOD testing of our method. Since the real splits contain images of common objects, we substitute these images with facial images from VFHQ. To investigate the impact of training sources, we also train existing methods on FF++ (Rossler et al., 2019) similar to previous one-to-many setting. The quantity of training samples for FF++ and HydraFake are kept consistent (both are $48$ K). Specifically: (1) our HydraFake-CD is more challenging than AIGIBench, with the best result being 6.7% lower (82.2% vs. 88.9%) and the second best result showing 10.0% decrease (74.7% vs. 84.7%). (2) On broader datasets, recent AIGC detection methods (e.g., Co-SPY and Effort) still demonstrates clear advantages to the methods tailored for deepfake (e.g., IID and ProDet). This may indicate that specialized modules for facial images may struggle to generalize well to modern fully synthesized and high-fidelity deepfakes. On these data, concurrently modeling the semantics and artifacts (like Co-SPY and Effort) may be more effective. (3) Expanding the training data from FF++ to HydraFake brings promising gains for recent AIGC detection methods. For instance, D3 increases from 58.7% to 73.1% on HydraFake. However, similar gains are not observed for deepfake detection methods, e.g., both IID and ProDet suffer from performance drop on HydraFake when trained on more diverse sources. This further reveal the gap between current deepfake detection methods and practical usage. When we have abundant training sources, the generalization performance does not scale up as expected. A comprehensive benchmark is necessary to measure the detectors’ capacities more practically. In Table 15, we conduct additional evaluations on extensive benchmarks, including generic AIGC detection task. Notably, **Veritas ** shows promising performance on these AIGC benchmarks, e.g., 72.1% on LOKI and 85.9% on FakeClue. Note that **Veritas ** is only trained with facial forgery data. Moreover, **Veritas ** generalizes well on the latest editing model (i.e., Nano-banana). We also provide reasoning cases in Figure 23, 24, 25, 26, 27 to show the impressive adapation capacities of Veritas.

A.5.3 Efficiency Comparison

In Table 16, we provide an efficiency analysis of our model. All the data are obtained on a single PPUE GPU with the original Transformers library implementation. We report the averaged inference time of a batch of images with the batch size set to $8$ . Specifically, we compare the efficiency of different reasoning paradigms of MLLMs. For post-hoc explanation and flexible reasoning models, we do not perform MiPO since the human annotated data is hard to obtain. We present a experimental prototype here to provide a understanding of the inference efficiency of our model. Since inference efficiency is influenced by input resolutions and task difficulty, we divide samples into four parts. Low resolutions are images with $256\times 256$ size and high resolutions are $1024\times 1024$ . Specifically, (1) our model achieves faster inference on low-resolution images, while becomes slower on high-resolution inputs. As discussed in Appendix A.5, when the model can perceive finer details, it can perform more thorough reasoning, leading to improved accuracy and more inference time. (2) Compared to flexible reasoning models, **Veritas ** incurs no significant increase in computational cost, yet achieves a 5.3% performance gain, demonstrating the effectiveness of pattern-aware reasoning. (3) Post-hoc explanation models exhibit little variation in efficiency across easy and hard samples, typically performing rigid, point-to-point analysis without adaptive reasoning. (4) Without the P-GRPO to activate “self-reflection” and “planning” mechanisms, our cold-start model achieves better efficiency while still maintaining competitive performance.

A.5.4 Effect of Training Data in P-GRPO Stage

In Table 19, we investigate the impact of training data in P-GRPO. For our Veritas, we adopt balanced sampling among manipulation types, which achieves superior performance compared to randomly sampled data. As adopted in mathematical and coding problems, the hard sampling (the samples that models fail to reach all correct answers in $8$ rollouts) achieves inferior performance in our case, but this still yields improvements over cold-start model. Moreover, we add about $1/3$ unseen data from AIGIBench into P-GRPO stage. Note that these data are unseen during previous training stages, but are not overlapped with the testing domain of HydraFake. As shown in Table 19, this yields promising improvements on cross-forgery scenarios (3.8% over our Veritas). From the observations, we point out that the cold-start model is a good policy model. While in this paper we only use in-domain data during P-GRPO for fair comparisons, the users can add OOD data flexibly to elevate the detection ability, which can be achieved in two approaches: (1) (a cheap and scalable way) adopt data with binary labels and our P-GRPO for training. (2) (a fine-grained and controllable way) use the cold-start model or **Veritas ** to further construct a high-quality CoT dataset for customized deepfake data. This may require manual preference filtering but can further enhance the reasoning quality on target data.

A.5.5 Analysis of Hyperparameters

Analysis of hyperparameter $\beta^{\prime}$ . $\beta^{\prime}$ controls the strength of penalty when the outputs deviate from the reference model. From Table 18, smaller $\beta^{\prime}$ yields better performance on cross-forgery and cross-domain sets, suggesting that stronger exploration helps activate advanced reasoning behaviors of the cold-start model, improving generalization on OOD data.

Analysis of hyperparameter $G$ (number of generated rollouts in P-GRPO). From Table 18, more generations within one group do not bring improvements while the training costs increase. We attribute this to the task gap between deeepfake detection and common tasks. While mathematical problems often admit multiple valid solution paths, deepfake detection is a fact-based classification task with a more constrained reasoning space. In such cases, excessive group size leads to redundant exploration. This is also the reason we apply cold-start before the online RL stage, which ensures meaningful exploration during RL.

A.5.6 Effect of Different Reward Model

We investigate different reward models, including SophiaVL-R1-Thinking-Reward-Model-3B (Fan et al., 2025), Qwen2.5-VL-3B, UnifiedReward-Qwen-3B and UnifiedReward-Qwen-7B. From Figure 10, SophiaVL-R1-3B achieves degraded performance. This is due to that Sophia is specifically trained to measure the quality of CoT, while the instruction following ability is limited, which is not capable of measuring the novelty of reflection content. In contrast, Qwen2.5-VL-3B and UnifiedReward-Qwen-3B can better distinguish the reflection quality. Further scaling up to 7B does not bring significant improvements.

A.6 Full Prompt Templates

In this section, we provide all the prompt templates used in our method. This includes the following parts:

•

The prompts for pattern-aware SFT data annotation. For fake images, the process contains three stages as shown in Figure 28, Figure 29 and Figure 30. For real images, the process contains two stages (without the anomalies detection stage) as shown in Figure 31 and Figure 32.

•

The prompts for reasoning quality evaluation, which contains score evaluation (Figure 33) and pairwise evaluation (Figure 34).

•

The prompt for generating personalization prompts, as shown in Figure 35.

•

The prompt for reflection quality reward model $\mathcal{M}$ , as shown in Figure 36.

•

The system prompt for our **Veritas ** model as shown in Figure 37. The system prompts for all training stages and inference stage are consistent.

•

The prompt for zero-shot inference of MLLMs. We tested several prompts for each MLLM. Firstly we found that directly prompting them to perform pattern-aware reasoning like **Veritas ** fails in most cases. Therefore we perform common CoT instead. The prompts for Qwen2.5-VL-7B, InternVL3-8B and GLM-4.1V-9B-Thinking are provided in Figure 38. For MiMo-VL-7B, we found that providing priori information is harmful to the performance, and we adopt simple system prompt instead, as shown in Figure 39. Similarly, we keep the default system prompt and only constraining the output format in user prompt for GPT-4o and Gemini-2.5-Pro, as shown in Figure 40.

A.7 More Qualitative Results

We provide more examples of our model’s reasoning outputs. Specifically, our model can perform adaptive pattern-aware reasoning, generating direct and concise analysis for obviously fake images. It can also conduct thorough and holistic analysis for high-fidelity fake images. All the examples except FF++ are from OOD scenarios. It is worth noting that while being trained on pure in-domain facial data, **Veritas ** exhibits promising AIGC analysis abilities, e.g., the infeasible date of birth on ID card (Figure 16) and over-stylized texture of fabric (Figure 15). Such abilities mainly emerges from the combination of MiPO and P-GRPO. As mentioned in our main text, MiPO ensures high-quality rollouts in subsequent stage, which enables more accurate policy updates for online RL. The effective explorations during RL facilitate the deep reasoning capacities. Note that such observation is different from that of BusterX++ (Wen et al., 2025b), which found that cold-start constrains the output distribution and shrinks the OOD generalization abilities. We suppose the discrepancy is due to the rich semantics in AI-generated content allow the MLLMs to succeed with pure RL, since they are proficient at capturing semantic-level clues and is capable of generating high-quality rollouts at initial stage. However, the semantics of deepfake images are extremely limited, with most anoamlies lying on low-level artifacts. In such cases, cold-start is necessary and our work incorporates SFT and MiPO to instill the human-aligned reasoning capacities into base models.

A.8 More Qualitative Comparisons with Existing MLLM-based detectors

We provide more reasoning comparisons between **Veritas ** and existing MLLM-based detectors. Among the compared models, M2F2-Det and FFAA are specialized for deepfake detection, while other methods are generic forgery detection models. As shown in Figure 17, 18, 19, M2F2-Det excels at performing faithful analyses within facial region. However, it lacks consideration of deeper dimensions, resulting in suboptimal performance on certain fully synthesized data that require considerations about overall context. FFAA provides more detailed analyses. However, the logical coherence between “description” and “reasoning” part is weak, and the “reasoning” part lacks in-depth understanding. FakeShield falls short on analyzing fully synthesized facial data, but it demonstrates certain advantages in local artifact analysis (Figure 18 lower), since it is specifically trained for IMDL tasks. SIDA-13B-description provides generally high-quality explanations. However, it has a tendency to classify real facial images as fake. FakeVLM provides low-quality explanations regarding facial forgeries despite its high detection accuracy, e.g., most cases are explained as “The image exhibits underlying characteristic inconsistencies in its features that suggest it is artificially created”. Such vague and template-like explanantions are likely due to its large-scale SFT training nature. In contrast, **Veritas ** generates holistic and faithful reasoning process.

A.9 Failure Analysis of Veritas

For real images, the failures mainly clustered at low-resolution data. As shown in Figure 22 upper, these data are generally in low quality, where the unexpected artifacts such as localized blurriness would affect the model’s judgement. For fake images, failures mainly occur on totally unseen forgery types such as face relighting. However, although the final answer is incorrect, **Veritas ** still figure out suspicious clues, e.g., “overly uniform water droplets raise red flag” and “the warm lighting introduces uncertainty” in Figure 22. This providing valuable insights that could be used for further scrutiny or future improvements.

A.10 The Use of Large Language Models

We used LLMs for grammatical refinement and language polishing of the paper, aiming to improve the clarity and readability. Some MLLMs are used for the annotation of reasoning data, which is a common practice. Besides, the LLMs are not involved in research design or idea generation.

A.11 Ethics Statement

All real facial data used in this work are from publicly available academic datasets. The fake images include those from public benchmarks and those generated by our team using generative models or face-swapping techniques. The latter were created only from public or synthetic data, with no unauthorized use of personal images. Our work focuses on improving deepfake detection to combat misinformation, and all data are used strictly for non-commercial, academic purposes.

A.12 Limitations and Future Work

While HydraFake involves multi-level evaluations, it is limited to the image modality. With recent advances in video generation models, extracting frames from videos and detecting manipulations solely based on spatial artifacts is challenging. Moreover, as analyzed in Appendix A.5, our **Veritas ** model still exhibits shortages on low-quality subsets such as DeepFaceLab and FFIW as the reasoning requires more visual details. Therefore, we figure out the future directions: (1) A collaborative system of MLLMs and small vision models, since the MLLM-based detectors (especially reasoning MLLMs) are good at analyzing high-resolution images while small vision models exhibit certain advantages on low-resolution counterparts. This has been explored by a recent work (Chen et al., 2024b), while how to develop a more adaptive or agent-like system is still an interesting problem. (2) A unified image-video deepfake benchmark. Recent video generation models are capable of creating high-fidelity talking faces and hand-face interactions (e.g., touching eyebrows or nose), posing new challenges to facial security systems. Due to high frame-level realism, traditional frame-based detectors often fail. Consequently, there is a growing need for unified detection frameworks capable of handling both image and video inputs, as well as rigorous benchmarks to facilitate the development of robust detection methods.

Bibliography112

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adobe (2023) Adobe. Adobe Firefly , 2023. {https://firefly.adobe.com/} .
2AI (2023) Starry AI. Starry ai, 2023. URL https://starryai.com/ . Accessed: 2025-03-11.
3Bai et al. (2025) Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen 2.5-vl technical report. ar Xiv preprint ar Xiv:2502.13923 , 2025.
4Black Forest Labs (2024) Black Forest Labs. Flux v 1.1 pro, 2024. URL https://replicate.com/black-forest-labs/flux-1.1-pro .
5Bongini et al. (2025) Pietro Bongini, Sara Mandelli, Andrea Montibeller, Mirko Casu, Orazio Pontorno, Claudio Vittorio Ragaglia, Luca Zanchetta, Mattia Aquilina, Taiba Majid Wani, Luca Guarnera, et al. Wild: a new in-the-wild image linkage dataset for synthetic image attribution. ar Xiv preprint ar Xiv:2504.19595 , 2025.
6Chen et al. (2024 a) Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Forty-first International Conference on Machine Learning , 2024 a.
7Chen et al. (2025 a) Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning. ar Xiv preprint ar Xiv:2506.04207 , 2025 a.
8Chen et al. (2025 b) Tao Chen, Jingyi Zhang, Decheng Liu, and Chunlei Peng. Mgffd-vlm: Multi-granularity prompt learning for face forgery detection with vlm. ar Xiv preprint ar Xiv:2507.12232 , 2025 b.