Can Multi-modal (reasoning) LLMs detect document manipulation?

Zisheng Liang; Kidus Zewde; Rudra Pratap Singh; Disha Patil; Zexi Chen; Jiayu Xue; Yao Yao; Yifei Chen; Qinzhe Liu; Simiao Ren

arXiv:2508.11021·cs.CV·August 18, 2025

Can Multi-modal (reasoning) LLMs detect document manipulation?

Zisheng Liang, Kidus Zewde, Rudra Pratap Singh, Disha Patil, Zexi Chen, Jiayu Xue, Yao Yao, Yifei Chen, Qinzhe Liu, Simiao Ren

PDF

TL;DR

This paper evaluates the effectiveness of various state-of-the-art multi-modal large language models in detecting document fraud, highlighting their strengths, limitations, and the importance of task-specific fine-tuning.

Contribution

It benchmarks multiple multi-modal LLMs on document fraud detection, revealing their zero-shot capabilities and the limited impact of model size on accuracy.

Findings

01

Top models outperform traditional methods on out-of-distribution data

02

Vision LLMs show inconsistent performance

03

Fine-tuning is crucial for task-specific accuracy

Abstract

Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models' reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.