The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs
Anh Thu Maria Bui, Saskia Felizitas Brech, Natalie Hu{\ss}feldt,, Tobias Jennert, Melanie Ullrich, Timo Breuer, Narjes Nikzad Khasmakhi,, Philipp Schaer

TL;DR
This paper investigates the use of multiple large language models to generate and detect hallucinated content, participating in a shared task to evaluate their effectiveness and combining their outputs for improved detection.
Contribution
It introduces a multi-model evaluation approach for hallucination detection and generation in LLMs, providing insights into model strengths and weaknesses.
Findings
Ensemble voting improves hallucination detection accuracy
GPT-4 shows strong performance in hallucination detection
Different models have complementary strengths in hallucination tasks
Abstract
Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHallucinations in medical conditions · Complex Systems and Time Series Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Label Smoothing · Linear Layer · Adam · Dropout · Weight Decay · Multi-Head Attention
