Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin; Dingfan Chen; Linyi Yang; Michael Backes; Xiao Zhang

arXiv:2512.24044·cs.CR·January 1, 2026

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang, Michael Backes, Xiao Zhang

PDF

Open Access

TL;DR

This paper systematically evaluates jailbreak attacks on LLM safety measures across the entire deployment pipeline, revealing that safety filters can detect most attacks but need improved balance between recall and precision.

Contribution

It is the first comprehensive study assessing jailbreak effectiveness throughout the full inference pipeline, including safety filters, highlighting detection capabilities and areas for improvement.

Findings

01

Most jailbreaks can be detected by safety filters

02

Safety filters effectively identify adversarial prompts

03

Room for improvement in detection accuracy and usability

Abstract

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly all evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling