STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Xutao Mao; Liangjie Zhao; Tao Liu; Xiang Zheng; Hongying Zan; Cong Wang

arXiv:2605.00699·cs.CR·May 8, 2026

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan, Cong Wang

PDF

TL;DR

STARE is a hierarchical reinforcement learning framework that analyzes the emergence of toxicity in multi-step image-text synthesis, enabling targeted attacks and phase-aware safety strategies.

Contribution

It introduces a trajectory-level view of toxicity emergence and a novel attack method that improves success rates and reveals causal temporal structures.

Findings

01

Achieves 68% improvement in attack success rate over baselines.

02

Identifies toxicity emergence in early semantic and late refinement phases.

03

Targeted perturbations can suppress specific toxicity categories.

Abstract

Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.