Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi; Yue Li; Dongsheng Shi; Linlin Wang; Xiaoling Wang; Liang He

arXiv:2511.14423·cs.CL·November 19, 2025

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He

PDF

Open Access

TL;DR

This paper introduces EduHarm, a benchmark for educational LLM safety, and proposes TSSF, a three-stage framework that effectively defends against jailbreak and fine-tuning attacks while maintaining utility.

Contribution

The paper presents a novel safety evaluation benchmark EduHarm and a comprehensive three-stage defense framework TSSF tailored for educational LLMs, addressing unique safety challenges.

Findings

01

TSSF significantly improves safety against jailbreak attacks.

02

TSSF maintains high utility for benign queries.

03

Effective defense demonstrated across multiple attack strategies.

Abstract

Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)