Lightweight Safety Classification Using Pruned Language Models
Mason Sawtell, Tula Masterman, Sandi Besen, Jim Brown

TL;DR
This paper presents a new, efficient safety classification method using pruned language models and intermediate transformer layers, outperforming specialized models with fewer examples and broad applicability across architectures.
Contribution
Introduces Layer Enhanced Classification (LEC), a technique combining a simple classifier with intermediate transformer layers for effective safety and prompt injection detection.
Findings
Small models and other transformers are effective feature extractors.
Intermediate layers outperform final layers in classification tasks.
Pruned models can serve as robust feature extractors for safety classification.
Abstract
In this paper, we introduce a novel technique for content safety and prompt injection classification for Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisticated language understanding of an LLM, our approach delivers superior performance surpassing GPT-4o and special-purpose models fine-tuned for each task. We find that small general-purpose models (Qwen 2.5 sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures like DeBERTa v3 are robust feature extractors allowing simple classifiers to be effectively trained on fewer than 100 high-quality examples. Importantly, the intermediate transformer layers of these models typically outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Web Application Security Vulnerabilities · Software Testing and Debugging Techniques
MethodsHow do I file a dispute with Expedia?*DisputeFastService · DeBERTa · Logistic Regression
