CodeGuard: Improving LLM Guardrails in CS Education
Nishat Raihan, Noah Erdachew, Jayoti Devi, Joanna C. S. Santos, Marcos Zampieri

TL;DR
This paper introduces CodeGuard, a comprehensive framework to improve the safety of LLMs in CS education by classifying prompts, creating a large dataset, and developing a real-time detection model, significantly reducing harmful outputs.
Contribution
We propose CodeGuard, including a new taxonomy, a large prompt dataset, and PromptShield, a real-time unsafe prompt detection model, advancing LLM safety in educational settings.
Findings
PromptShield achieves 0.93 F1 score in detecting unsafe prompts.
CodeGuard reduces harmful code completions by 30-65%.
Framework improves safety without harming educational performance.
Abstract
Large language models (LLMs) are increasingly embedded in Computer Science (CS) classrooms to automate code generation, feedback, and assessment. However, their susceptibility to adversarial or ill-intentioned prompts threatens student learning and academic integrity. To cope with this important issue, we evaluate existing off-the-shelf LLMs in handling unsafe and irrelevant prompts within the domain of CS education. We identify important shortcomings in existing LLM guardrails which motivates us to propose CodeGuard, a comprehensive guardrail framework for educational AI systems. CodeGuard includes (i) a first-of-its-kind taxonomy for classifying prompts; (ii) the CodeGuard dataset, a collection of 8,000 prompts spanning the taxonomy; and (iii) PromptShield, a lightweight sentence-encoder model fine-tuned to detect unsafe prompts in real time. Experiments show that PromptShield…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Teaching and Learning Programming · Academic integrity and plagiarism
