STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan, Cong Wang

TL;DR
STARE is a hierarchical reinforcement learning framework that analyzes the emergence of toxicity in multi-step image-text synthesis, enabling targeted attacks and phase-aware safety strategies.
Contribution
It introduces a trajectory-level view of toxicity emergence and a novel attack method that improves success rates and reveals causal temporal structures.
Findings
Achieves 68% improvement in attack success rate over baselines.
Identifies toxicity emergence in early semantic and late refinement phases.
Targeted perturbations can suppress specific toxicity categories.
Abstract
Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
