The Two Sides of the Coin: Hallucination Generation and Detection with   LLMs as Evaluators for LLMs

Anh Thu Maria Bui; Saskia Felizitas Brech; Natalie Hu{\ss}feldt,; Tobias Jennert; Melanie Ullrich; Timo Breuer; Narjes Nikzad Khasmakhi,; Philipp Schaer

arXiv:2407.09152·cs.AI·July 15, 2024

The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs

Anh Thu Maria Bui, Saskia Felizitas Brech, Natalie Hu{\ss}feldt,, Tobias Jennert, Melanie Ullrich, Timo Breuer, Narjes Nikzad Khasmakhi,, Philipp Schaer

PDF

Open Access

TL;DR

This paper investigates the use of multiple large language models to generate and detect hallucinated content, participating in a shared task to evaluate their effectiveness and combining their outputs for improved detection.

Contribution

It introduces a multi-model evaluation approach for hallucination detection and generation in LLMs, providing insights into model strengths and weaknesses.

Findings

01

Ensemble voting improves hallucination detection accuracy

02

GPT-4 shows strong performance in hallucination detection

03

Different models have complementary strengths in hallucination tasks

Abstract

Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHallucinations in medical conditions · Complex Systems and Time Series Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Label Smoothing · Linear Layer · Adam · Dropout · Weight Decay · Multi-Head Attention