Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He

TL;DR
This paper introduces EduHarm, a benchmark for educational LLM safety, and proposes TSSF, a three-stage framework that effectively defends against jailbreak and fine-tuning attacks while maintaining utility.
Contribution
The paper presents a novel safety evaluation benchmark EduHarm and a comprehensive three-stage defense framework TSSF tailored for educational LLMs, addressing unique safety challenges.
Findings
TSSF significantly improves safety against jailbreak attacks.
TSSF maintains high utility for benign queries.
Effective defense demonstrated across multiple attack strategies.
Abstract
Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
