Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction
Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen

TL;DR
This paper presents an automated, scalable method for detecting jailbreak attempts in clinical training LLMs by extracting linguistic features using BERT models, improving safety in clinical dialogue systems.
Contribution
It introduces an automated framework that uses expert-annotated linguistic features and BERT-based models to detect jailbreak attempts, enhancing scalability and interpretability.
Findings
High overall performance in jailbreak detection across evaluations
Effective use of BERT-based models for predicting linguistic features
Identification of key limitations and future directions in annotation and feature extraction
Abstract
Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Topic Modeling
