Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai; Shixuan Liu; Zhiyuan Zhao; Junyu Gao; Hao Sun; Xuelong Li

arXiv:2507.22037·cs.CR·July 30, 2025

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li

PDF

TL;DR

SecTOW introduces an iterative reinforcement learning approach with defender and attacker modules to enhance multimodal model security against jailbreak inputs, effectively expanding vulnerabilities and improving safety without sacrificing performance.

Contribution

It presents a novel iterative defense-attack training framework using reinforcement learning, addressing intrinsic vulnerabilities and data scarcity in multimodal model security.

Findings

01

Significantly improves model security against jailbreak inputs.

02

Maintains high performance on safety and general benchmarks.

03

Efficiently expands jailbreak data with synthetic responses.

Abstract

The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs--jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.