Are Large Language Models Good at Utility Judgments?

Hengran Zhang; Ruqing Zhang; Jiafeng Guo; Maarten de Rijke; Yixing; Fan; Xueqi Cheng

arXiv:2403.19216·cs.IR·June 11, 2024·1 cites

Are Large Language Models Good at Utility Judgments?

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing, Fan, Xueqi Cheng

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the ability of large language models to assess the utility of passages for open-domain question answering, introducing a benchmark and analyzing factors influencing utility judgments.

Contribution

It presents a comprehensive benchmark and analysis of LLMs' utility evaluation capabilities, including a novel approach to reduce input sequence dependency.

Findings

01

LLMs can distinguish relevance from utility with proper instruction

02

LLMs are highly receptive to counterfactual passages

03

A k-sampling, listwise method improves answer generation

Abstract

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ict-bigdatalab/utility_judgments
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Multi-Head Attention · Dense Connections