Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity,   Bias and Propensity for Hallucinations

David Nadeau; Mike Kroutikov; Karen McNeil; Simon Baribeau

arXiv:2404.09785·cs.CL·April 16, 2024·2 cites

Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations

David Nadeau, Mike Kroutikov, Karen McNeil, Simon Baribeau

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper evaluates the safety, factuality, toxicity, bias, and hallucination propensity of Llama2, Mistral, Gemma, and GPT using fourteen new datasets, highlighting strengths and weaknesses in various safety aspects across models.

Contribution

Introduces fourteen novel safety evaluation datasets and a method for assessing large language models' safety in enterprise tasks, comparing open-source models with GPT.

Findings

01

GPT outperforms others in safety and factuality

02

Mistral hallucinates the least but struggles with toxicity

03

Open-source models' safety degrades in multi-turn conversations

Abstract

This paper introduces fourteen novel datasets for the evaluation of Large Language Models' safety in the context of enterprise tasks. A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content. In this research, we used OpenAI GPT as point of comparison since it excels at all levels of safety. On the open-source side, for smaller models, Meta Llama2 performs well at factuality and toxicity but has the highest propensity for hallucination. Mistral hallucinates the least but cannot handle toxicity well. It performs well in a dataset mixing several tasks and safety vectors in a narrow vertical domain. Gemma, the newly introduced open-source model based on Google Gemini, is generally balanced but trailing behind. When engaging in back-and-forth conversation (multi-turn prompts), we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

innodatalabs/innodata-llm-safety
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPain Management and Placebo Effect