Are Large Language Models Reliable Judges? A Study on the Factuality   Evaluation Capabilities of LLMs

Xue-Yong Fu; Md Tahmid Rahman Laskar; Cheng Chen; Shashi Bhushan TN

arXiv:2311.00681·cs.CL·November 2, 2023·1 cites

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN

PDF

Open Access

TL;DR

This paper investigates whether large language models can reliably evaluate the factual accuracy of generated summaries, revealing significant limitations and weak correlations with human judgments, especially for GPT-4 and PaLM-2.

Contribution

It introduces a novel approach for factuality assessment using a single LLM and benchmarks various LLMs against traditional and human evaluation methods.

Findings

01

GPT-3.5 shows some correlation with human judgments

02

GPT-4 and PaLM-2 lack significant correlation with human evaluations

03

Current LLMs have fundamental limitations in factuality assessment

Abstract

In recent years, Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities, surpassing those seen in earlier language models. A particularly intriguing application of LLMs is their role as evaluators for texts produced by various generative models. In this study, we delve into the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models. Initially, we introduce an innovative approach for factuality assessment using LLMs. This entails employing a singular LLM for the entirety of the question-answering-based factuality scoring process. Following this, we examine the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations. Contrary to initial expectations, our results indicate a lack of significant correlations between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Label Smoothing · Linear Layer · Softmax · Linear Warmup With Cosine Annealing · Dropout