EVADE-Bench: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications
Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Hamid Alinejad-Rokny, Min Yang

TL;DR
EVADE-Bench introduces a comprehensive multimodal benchmark dataset designed to evaluate and improve the ability of foundation models to detect evasive, misleading content in e-commerce, addressing a critical challenge in online content moderation.
Contribution
This paper presents the first expert-curated, multimodal benchmark specifically targeting evasive content detection in e-commerce, including new tasks, datasets, and baseline evaluations for foundation models.
Findings
State-of-the-art models often misclassify evasive samples.
Clearer rule definitions improve model alignment and reasoning.
Benchmark exposes fundamental limitations in current multimodal models.
Abstract
E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well written and easy to follow. 2. The task of evasive content detection is meaningful for many real-world scenarios. The proposed benchmark bears significant value towards the enhancement of the relevant techniques. 3. The authors perform a series of evaluation with the existing LLMs and VLMs. The results and error analysis shed a light on the further development.
1. The human annotation quality is not uncovered by the authors. It is unknown how hard this task is even for a human being. Also, fine-grained breakdown of the category distribution of the dataset should be reported.
1. According to the authors, this work presents the first Chinese multimodal benchmark for evasive content detection in e-commerce scenarios. The study holds strong practical significance in detecting potentially deceptive or policy-violating advertisements and demonstrates both social value and pioneering potential in improving content safety and regulation in online commerce. 2. The authors employ a data construction pipeline that integrates human experts and large language models (LLMs), e
1. The differences between the two task paradigms, Single-Violation and All-in-One, are not sufficiently explained. 2. The paper lacks an evaluation of annotation quality. Given that the task is a complex multi-class classification problem, inter-annotator agreement among experts is essential to ensure data reliability. 3. In the All-in-One task, the authors introduce two subtasks, “simplified description” and “detailed description,” but the motivation behind this design and the performanc
1. Working on a real-world application challenge by focusing on evasive content detection. 2. Constructs a large-scale, expert-annotated multimodal dataset (covering text and image inputs) with rigorous annotation pipelines, ensuring high-quality and regulation-aligned ground truth. 3. Conducts extensive experiments on 26 LLMs/VLMs (both open-source and closed-source), offering systematic baselines and detailed error analysis that sheds light on model limitations in multimodal reasoning. 4. Vali
1. Unclear definition of Evasive Content Detection (ECD): The paper fails to provide a rigorous, cited definition of ECD, leaving ambiguity about its scope and distinguishing features. It also does not adequately justify why ECD is a unique and critical task for LLMs/VLMs, nor does it clarify how ECD differs from existing QA or adversarial detection tasks. 2. Narrow focus on domain-specific scenarios without generalizable capability assessment: The benchmark is limited to six e-commerce product
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
