OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs
Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang, Xia Hu

TL;DR
OpenRT is a comprehensive, modular framework for systematically evaluating the safety vulnerabilities of multimodal large language models through diverse attack strategies and extensive empirical testing.
Contribution
We introduce OpenRT, a scalable, modular red-teaming framework that standardizes attack interfaces and integrates 37 attack methods for thorough MLLM safety evaluation.
Findings
Frontier models show safety gaps with up to 49.14% attack success rate.
Reasoning models do not inherently have better robustness against multi-turn jailbreaks.
Models fail to generalize safety defenses across different attack paradigms.
Abstract
The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
