LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy, Pedro M. Esperan\c{c}a, Silviu Vlad Oprea, Mete Ozay

TL;DR
LoRA-Guard is a parameter-efficient method enabling on-device content moderation for large language models by sharing knowledge between models with minimal resource overhead.
Contribution
It introduces a novel low-rank adapter-based approach for guardrail adaptation that significantly reduces parameter overhead while maintaining accuracy.
Findings
Outperforms existing methods with 100-1000x fewer parameters
Maintains accuracy on content moderation tasks
Enables on-device deployment of LLM guardrails
Abstract
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
