Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations
David Nadeau, Mike Kroutikov, Karen McNeil, Simon Baribeau

TL;DR
This paper evaluates the safety, factuality, toxicity, bias, and hallucination propensity of Llama2, Mistral, Gemma, and GPT using fourteen new datasets, highlighting strengths and weaknesses in various safety aspects across models.
Contribution
Introduces fourteen novel safety evaluation datasets and a method for assessing large language models' safety in enterprise tasks, comparing open-source models with GPT.
Findings
GPT outperforms others in safety and factuality
Mistral hallucinates the least but struggles with toxicity
Open-source models' safety degrades in multi-turn conversations
Abstract
This paper introduces fourteen novel datasets for the evaluation of Large Language Models' safety in the context of enterprise tasks. A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content. In this research, we used OpenAI GPT as point of comparison since it excels at all levels of safety. On the open-source side, for smaller models, Meta Llama2 performs well at factuality and toxicity but has the highest propensity for hallucination. Mistral hallucinates the least but cannot handle toxicity well. It performs well in a dataset mixing several tasks and safety vectors in a narrow vertical domain. Gemma, the newly introduced open-source model based on Google Gemini, is generally balanced but trailing behind. When engaging in back-and-forth conversation (multi-turn prompts), we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPain Management and Placebo Effect
