Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika, Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine,, Madian Khabsa

TL;DR
Llama Guard is a specialized LLM-based safety classifier designed for human-AI conversations, utilizing a safety taxonomy and fine-tuning to effectively categorize prompts and responses, thereby enhancing content moderation.
Contribution
It introduces a new safety taxonomy and a fine-tuned Llama2-7b model for prompt and response classification in AI safety, with strong benchmark performance.
Findings
Matches or exceeds existing moderation tools on benchmarks
Effective multi-class classification and binary scoring
Customizable taxonomy and output formats
Abstract
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗meta-llama/Llama-Guard-3-11B-Visionmodel· 2.4k dl· ♡ 702.4k dl♡ 70
- 🤗meta-llama/Llama-Guard-3-8Bmodel· 83k dl· ♡ 28383k dl♡ 283
- 🤗meta-llama/Meta-Llama-Guard-2-8Bmodel· 7.1k dl· ♡ 3077.1k dl♡ 307
- 🤗QuantFactory/Meta-Llama-Guard-2-8B-GGUFmodel· 300 dl· ♡ 12300 dl♡ 12
- 🤗RichardErkhov/meta-llama_-_Meta-Llama-Guard-2-8B-4bitsmodel· 5 dl5 dl
- 🤗RichardErkhov/meta-llama_-_Meta-Llama-Guard-2-8B-8bitsmodel· 8 dl8 dl
- 🤗Efficient-Large-Model/Meta-Llama-Guard-2-8Bmodel· 3 dl3 dl
- 🤗tybrs/llama-guard-quantmodel· 228 dl228 dl
- 🤗LiteLLMs/Meta-Llama-Guard-2-8B-GGUFmodel· 5 dl5 dl
- 🤗meta-llama/Llama-Guard-3-8B-INT8model· 8.6k dl· ♡ 388.6k dl♡ 38
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Interpreting and Communication in Healthcare
MethodsSparse Evolutionary Training · ALIGN
