When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents
Jiaqi Wu, Yuchen Zhou, Dennis Tsang Ng, Xingyu Shen, Kidus Zewde, Ankit Raj, Tommy Duong, Simiao Ren

TL;DR
This paper demonstrates that GPT-Image-2 can produce document forgeries indistinguishable from real images, and evaluates the effectiveness of human and computational detectors, revealing significant detection challenges.
Contribution
It introduces a new dataset of GPT-Image-2 forgeries, benchmarks multiple detection methods, and shows the difficulty of identifying AI-generated document edits.
Findings
Humans perform at chance level in detecting forgeries.
Computational detectors only modestly outperform chance.
Detection accuracy drops significantly when identifying GPT-Image-2 inpainting.
Abstract
OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
