GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu

TL;DR
GuardTrace-VL is a vision-aware safety auditing method for multimodal reasoning models that detects unsafe intermediate content during the reasoning process, significantly improving safety detection accuracy.
Contribution
The paper introduces GuardTrace-VL, a novel joint image-text safety auditor with a new dataset and training scheme, enhancing detection of unsafe reasoning in multimodal models.
Findings
Achieves 93.1% F1 score on unsafe reasoning detection
Outperforms previous safety methods by 13.5% F1 score
Effective in both in-domain and out-of-domain scenarios
Abstract
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Topic Modeling
