QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

Dishanika Denipitiyage; Aruna Seneviratne; Suranga Seneviratne

arXiv:2605.20584·cs.CV·May 21, 2026

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne

PDF

TL;DR

QwenSafe is a multimodal vision-language model that automatically identifies content rating descriptors in mobile apps by reasoning over app metadata and screenshots, improving accuracy and consistency.

Contribution

The paper introduces QwenSafe and a data pipeline, metadata2CRD, for scalable training, and demonstrates superior performance over state-of-the-art models in content rating classification.

Findings

01

QwenSafe outperforms baselines in binary CRD classification.

02

Model achieves 111.8% improvement in positive-class recall.

03

Descriptor-aware multimodal alignment enhances content classification accuracy.

Abstract

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.