Are Large Language Models Good at Utility Judgments?
Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing, Fan, Xueqi Cheng

TL;DR
This paper evaluates the ability of large language models to assess the utility of passages for open-domain question answering, introducing a benchmark and analyzing factors influencing utility judgments.
Contribution
It presents a comprehensive benchmark and analysis of LLMs' utility evaluation capabilities, including a novel approach to reduce input sequence dependency.
Findings
LLMs can distinguish relevance from utility with proper instruction
LLMs are highly receptive to counterfactual passages
A k-sampling, listwise method improves answer generation
Abstract
Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Multi-Head Attention · Dense Connections
