Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis

Alessandro Scir\`e; Andrei Stefan Bejgu; Simone Tedeschi; Karim; Ghonim; Federico Martelli; Roberto Navigli

arXiv:2411.19655·cs.CL·April 1, 2025

Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis

Alessandro Scir\`e, Andrei Stefan Bejgu, Simone Tedeschi, Karim, Ghonim, Federico Martelli, Roberto Navigli

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces LLM-Oasis, the largest dataset for training end-to-end factuality evaluators, addressing limitations of previous resources and challenging state-of-the-art LLMs in factuality assessment.

Contribution

The creation of LLM-Oasis, a large, task-agnostic dataset for training and benchmarking factuality evaluators, with human-validated claims and a challenging test set.

Findings

01

GPT-4o achieves up to 60% accuracy on the dataset

02

LLM-Oasis significantly challenges current LLMs in factuality evaluation

03

The dataset enables training more robust factuality evaluators

Abstract

After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Babelscape/LLM-Oasis
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsSparse Evolutionary Training