From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Xinyi Shang; Yi Tang; Jiacheng Cui; Ahmed Elhagry; Salwa K. Al Khatib; Sondos Mahmoud Bsharat; Jiacheng Liu; Xiaohan Zhao; Jing-Hao Xue; Hao Li; Salman Khan; Zhiqiang Shen

arXiv:2603.20193·cs.CV·March 23, 2026

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen

PDF

Open Access

TL;DR

This paper introduces a pixel-level, meaning-aware framework for image tampering detection that improves upon mask-based benchmarks by providing detailed localization, semantic understanding, and natural language descriptions of edits.

Contribution

It presents a new taxonomy, a benchmark with pixel-level tamper maps, and evaluation metrics that incorporate semantics and language understanding for tampering detection.

Findings

01

Existing mask-based metrics can misjudge tampering severity.

02

The new benchmark reveals limitations of current detectors on micro-edits.

03

The proposed framework improves localization and semantic classification of tampered regions.

Abstract

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques