THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Tzu-Yen Ma; Bo Zhang; Zichen Tang; Junpeng Ding; Haolin Tian; Yuanze Li; Zhuodi Hao; Zixin Ding; Zirui Wang; Xinyu Yu; Shiyao Peng; Yizhuo Zhao; Ruomeng Jiang; Yiling Huang; Peizhi Zhao; Jiayuan Chen; Weisheng Tan; Haocheng Gao; Yang Liu; Jiacheng Liu; Zhongjun Yang; Jiayu Huang; Haihong E

arXiv:2603.25089·cs.CV·March 27, 2026

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Tzu-Yen Ma, Bo Zhang, Zichen Tang, Junpeng Ding, Haolin Tian, Yuanze Li, Zhuodi Hao, Zixin Ding, Zirui Wang, Xinyu Yu, Shiyao Peng, Yizhuo Zhao, Ruomeng Jiang, Yiling Huang, Peizhi Zhao, Jiayuan Chen, Weisheng Tan, Haocheng Gao, Yang Liu, Jiacheng Liu, Zhongjun Yang, Jiayu Huang

PDF

Open Access 3 Reviews

TL;DR

THEMIS is a comprehensive benchmark designed to evaluate multimodal large language models on complex visual fraud reasoning in real-world academic scenarios, highlighting current model limitations.

Contribution

The paper introduces THEMIS, a novel multi-task benchmark with diverse scenarios, fraud types, and fine-grained manipulations for rigorous evaluation of MLLMs.

Findings

01

Even the best model, GPT-5, scores only 56.15% overall.

02

The benchmark covers over 4,000 questions from real retracted papers.

03

Models struggle with complex textures and multiple manipulations.

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The constructed benchmark covers a wide range of academic disciplines and forgery types, and the evaluation includes a relatively comprehensive set of model categories. 2. Although there are a few typos, the paper is overall well-written, and the figures and visual presentations are clear and well-designed.

Weaknesses

1. The authors mention using the **Fitz** library and **YOLOv7** for information extraction and segmentation. However, to my knowledge, there are now more accurate tools in the document extraction domain, such as **dots.ocr** and **MinerU**. Given the diversity of samples, the effectiveness of simply applying Fitz (PyMuPDF) and YOLOv7 is questionable. Since subsequent steps rely heavily on accurate information extraction, this stage could be improved to ensure higher benchmark quality and reliab

Reviewer 02Rating 8Confidence 4

Strengths

1. THEMIS defines a comprehensive taxonomy for the field of scientific paper fraud forensics. Specifically, THEMIS covers 7 academic scenarios, 5 tasks, 16 manipulation operations, and 5 core reasoning capabilities, which is more diverse than existing visual fraud reasoning benchmarks. 2. The data quality of THEMIS is high. The synthetic data is rigorously reviewed by human experts. Moreover, THEMIS contains real samples in addition to synthetic samples, which makes it closer to real-world appl

Weaknesses

1. The MLLMs evaluated in this paper are not comprehensive enough. It is recommended to supplement the results of InternVL3.5, GLM4.5V, Gemini-2.5-Pro, Claude, etc. 2. There is a lack of comparsion on difference parameter sizes of the same series of MLLMs (e.g., Qwen2.5-VL-3B/7B/32B/72B). 3. The conclusion in lines 373-375 is not well explained.

Reviewer 03Rating 8Confidence 3

Strengths

* The starting point is novel and has practical application value. * The data construction is complete, clear and reproducible.

Weaknesses

* Typo: There is an issue with the citation in 4.4 Appendix * From the perspective of the benchmark, it is quite well done. However, in the long run, this topic should be more suitable for optimizing, training, and fine-tuning models. If there are fine-tuning results, the value of this benchmark will be even higher.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Handwritten Text Recognition Techniques · Digital Media Forensic Detection