ProGuard: Towards Proactive Multimodal Safeguard
Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao

TL;DR
ProGuard is a vision-language model designed to proactively identify and describe out-of-distribution safety risks in multimodal generative models, using a large annotated dataset and reinforcement learning for improved safety moderation.
Contribution
It introduces a large, balanced multimodal safety dataset and a reinforcement learning-based training method for proactive safety risk detection and description.
Findings
ProGuard matches large models in safety classification performance.
It significantly outperforms existing open-source guards in unsafe content categorization.
ProGuard improves OOD risk detection by 52.6% and description by 64.8%.
Abstract
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
