AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Han Zhu; Jiale Chen; Chengkun Cai; Shengjie Sun; Haoran Li; Yujin Zhou; Chi-Min Chan; Pengcheng Wen; Lei Li; Sirui Han; Yike Guo

arXiv:2601.04736·cs.CL·January 9, 2026

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo

PDF

Open Access

TL;DR

This paper introduces AM$^3$Safety, a novel framework for improving multi-modal multi-turn safety in large language models by leveraging a new dataset and a specialized training approach, significantly reducing harmful outputs while maintaining performance.

Contribution

It presents a new multi-modal dialogue dataset and a fine-tuning method that enhances safety in MLLMs across multi-turn interactions, addressing limitations of existing single-turn RLHF approaches.

Findings

01

Over 10% reduction in Attack Success Rate (ASR)

02

At least 8% increase in harmless responses

03

Over 13% improvement in helpful responses

Abstract

Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Hate Speech and Cyberbullying Detection