Efficient LLM Moderation with Multi-Layer Latent Prototypes
Maciej Chrab\k{a}szcz, Filip Szatkowski, Bartosz W\'ojcik, Jan Dubi\'nski, Tomasz Trzci\'nski, Sebastian Cygert

TL;DR
The paper introduces MLPM, a lightweight, multi-layer prototype-based moderation tool that enhances safety and efficiency in deploying large language models, outperforming existing methods across various benchmarks.
Contribution
We propose a novel multi-layer prototype approach for LLM moderation that is highly customizable, efficient, and easily integrable into existing pipelines.
Findings
Achieves state-of-the-art moderation performance
Maintains high efficiency with negligible overhead
Scales effectively across different model sizes
Abstract
Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Law · Natural Language Processing Techniques
