RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
Zhifeng Lu, Dianyuan Wang, Yuhu Shang, and Zhenbo Xu

TL;DR
RuleSafe-VL is a new benchmark designed to evaluate how well vision-language models understand and apply complex moderation rules, moving beyond simple label matching to assess decision reasoning in content moderation.
Contribution
It introduces a formalized set of moderation rules and diagnostic tasks to evaluate rule-conditioned decision reasoning in vision-language models.
Findings
Best model achieves only 64.8 Macro-F1 in rule-relation recovery.
Safety-oriented models perform below 7 Macro-F1.
Decision-state prediction peaks at 64.5 Macro-F1.
Abstract
Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether the available evidence is sufficient. Current multimodal safety benchmarks largely reduce moderation to matching predefined final labels, leaving this underlying rule structure untested. As a result, a high benchmark score reveals little about whether a model applies the policy correctly or arrives at the correct label through superficial cues. To evaluate this rule-governed process, we introduce RuleSafe-VL, a benchmark for rule-conditioned decision reasoning in vision-language content moderation. Derived from publicly available platform moderation policies, RuleSafe-VL formalizes 93 atomic rules and 92 typed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
