Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Yuqi Tang; Yang Shi; Zhuoran Zhang; Qixun Wang; Xuehai Bai; Yue Ding; Ruizhe Chen; Bohan Zeng; Xinlong Chen; Xuanyu Zhu; Bozhou Li; Yuran Wang; Yifan Dai; Chengzhuo Tong; Xinyu Liu; Yiyan Ji; Yujie Wei; Yuhao Dong; Shilin Yan; Fengxiang Wang; Yi-Fan Zhang; Haotian Wang; Yuanxing Zhang; Pengfei Wan

arXiv:2605.18984·cs.CV·May 20, 2026

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang, Xuehai Bai, Yue Ding, Ruizhe Chen, Bohan Zeng, Xinlong Chen, Xuanyu Zhu, Bozhou Li, Yuran Wang, Yifan Dai, Chengzhuo Tong, Xinyu Liu, Yiyan Ji, Yujie Wei, Yuhao Dong, Shilin Yan, Fengxiang Wang, Yi-Fan Zhang, Haotian Wang

PDF

1 Repo 1 Datasets

TL;DR

Artifact-Bench is a new benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to detect, compare, and analyze artifacts in AI-generated videos across various styles.

Contribution

The paper introduces Artifact-Bench, a comprehensive hierarchical taxonomy and evaluation tasks for assessing MLLMs' artifact perception in diverse AI-generated videos.

Findings

01

Many MLLMs perform near random in artifact detection tasks.

02

Significant misalignment exists between MLLM judgments and human preferences.

03

Current MLLMs have limited reliability in evaluating AI-generated video realism.

Abstract

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

frankyang-17/Artifact-Bench
github

Datasets

DogNeverSleep/Artifact-Bench
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.