GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang; Simiao Ren; Ankit Raj; En Wei; Dennis Ng; Alex Shen; Jiayu Xue; Yuxin Zhang; Evelyn Marotta

arXiv:2603.11442·cs.AI·March 26, 2026

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayu Xue, Yuxin Zhang, Evelyn Marotta

PDF

Open Access

TL;DR

This paper introduces GPT4o-Receipt, a benchmark dataset and human study revealing that humans are better at detecting AI artifacts visually, but worse at verifying correctness, compared to advanced LLMs in AI-generated receipt detection.

Contribution

It provides a new dataset and evaluation framework for AI document forensics, highlighting the paradoxical human and machine detection capabilities and the importance of verification signals.

Findings

01

Humans outperform LLMs in visual detection of AI artifacts.

02

LLMs are faster and more accurate at verifying arithmetic correctness.

03

Detection accuracy varies significantly across different models and evaluation metrics.

Abstract

Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Digital and Cyber Forensics · Benford’s Law and Fraud Detection