Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis
Alessandro Scir\`e, Andrei Stefan Bejgu, Simone Tedeschi, Karim, Ghonim, Federico Martelli, Roberto Navigli

TL;DR
This paper introduces LLM-Oasis, the largest dataset for training end-to-end factuality evaluators, addressing limitations of previous resources and challenging state-of-the-art LLMs in factuality assessment.
Contribution
The creation of LLM-Oasis, a large, task-agnostic dataset for training and benchmarking factuality evaluators, with human-validated claims and a challenging test set.
Findings
GPT-4o achieves up to 60% accuracy on the dataset
LLM-Oasis significantly challenges current LLMs in factuality evaluation
The dataset enables training more robust factuality evaluators
Abstract
After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsSparse Evolutionary Training
