Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Tri Nguyen; Huy Hoang Bao Le; Lohith Srikanth Pentapalli; Laurah Turner; Kelly Cohen

arXiv:2602.13321·cs.AI·February 17, 2026

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen

PDF

Open Access

TL;DR

This paper presents an automated, scalable method for detecting jailbreak attempts in clinical training LLMs by extracting linguistic features using BERT models, improving safety in clinical dialogue systems.

Contribution

It introduces an automated framework that uses expert-annotated linguistic features and BERT-based models to detect jailbreak attempts, enhancing scalability and interpretability.

Findings

01

High overall performance in jailbreak detection across evaluations

02

Effective use of BERT-based models for predicting linguistic features

03

Identification of key limitations and future directions in annotation and feature extraction

Abstract

Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Topic Modeling