MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
Muyang Zheng, Yuanzhi Yao, Changting Lin, Caihong Kai, Yanxiang Chen, Zhiquan Liu

TL;DR
This paper introduces MIST, an iterative semantic tuning method that effectively jailbreaks black-box large language models by refining prompts to induce harmful responses with minimal queries.
Contribution
MIST is a novel approach that combines synonym search and order optimization to efficiently bypass model alignment and safety measures.
Findings
MIST achieves high attack success rates on multiple models.
It requires fewer queries compared to existing methods.
MIST demonstrates good transferability and computational efficiency.
Abstract
Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks -- methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version -- order-determining optimization. We conduct extensive experiments on two datasets using two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
