TL;DR
DocShield introduces a unified, evidence-grounded AI framework for detecting, localizing, and explaining text-centric document forgeries through visual-logical co-reasoning.
Contribution
It presents the first integrated approach combining visual and textual reasoning for document forgery detection, along with a new multilingual dataset and code release.
Findings
Outperforms existing methods with a 41.4% increase in macro F1 score.
Achieves 23.4% higher macro F1 on T-IC13 compared to GPT-4o.
Demonstrates effective evidence-grounded forensic analysis through novel reasoning mechanisms.
Abstract
The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
