Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs
Alexandre Cristov\~ao Maiorano

TL;DR
This paper evaluates prompt-injection defenses for educational language models, analyzing security, usability, and latency trade-offs to guide guardrail selection in AI tutoring systems.
Contribution
It introduces a comprehensive evaluation methodology and benchmark protocol for comparing prompt-injection defenses in educational LLM tutors.
Findings
The multi-layer safeguard pipeline achieves low bypass and false positive rates with optimized latency.
NeMo Guardrails reach 0% bypass at 16.22% FPR and ~1.5s latency.
Prompt Guard yields 38.48% bypass with 3.60% FPR.
Abstract
Educational LLM tutors face a core AI alignment challenge: they must follow user intent while preserving pedagogical constraints and safety policies. We present an evaluation methodology for prompt-injection defenses in this setting, showing that guardrail design entails explicit trade-offs among adversarial robustness, benign-task usability, and response latency. We evaluate a domain-specific multi-layer safeguard pipeline combining deterministic pattern filters, structural validation, contextual sandboxing, and session-level behavioral checks. On a controlled holdout benchmark, the pipeline reaches low bypass and false positive rates with optimized average latency - an operating point that prioritizes pedagogical usability (zero false positives) while maintaining measurable attack resistance. We provide a reproducible benchmark protocol for head-to-head comparison under identical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
