AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang; Tzu-Yen Ma; Zichen Tang; Junpeng Ding; Zirui Wang; Yizhuo Zhao; Peilin Gao; Zijie Xi; Zixin Ding; Haiyang Sun; Haocheng Gao; Yuan Liu; Liangjia Wang; Yiling Huang; Yujie Wang; Yuyue Zhang; Ronghui Xi; Yuanze Li; Jiacheng Liu; Zhongjun Yang; Haihong E

arXiv:2604.28177·cs.CV·May 22, 2026

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

PDF

TL;DR

AEGIS is a comprehensive benchmark that evaluates the forensic analysis of AI-generated academic images across multiple dimensions, revealing current limitations and strengths in detection, localization, and reasoning.

Contribution

It introduces a novel, multi-faceted benchmark covering diverse academic categories, forgery strategies, and evaluation metrics, advancing the assessment of AI-generated image forensics.

Findings

01

GPT-5.1 reaches 48.80% overall performance in forensic detection.

02

Most forgery methods have average accuracy below 50%.

03

Multimodal large language models achieve 84.74% accuracy in textual artifact recognition.

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.