Ranking Generated Answers: On the Agreement of Retrieval Models with   Humans on Consumer Health Questions

Sebastian Heineking; Jonas Probst; Daniel Steinbach; Martin Potthast,; Harrisen Scells

arXiv:2408.09831·cs.IR·January 20, 2025

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast,, Harrisen Scells

PDF

Open Access 1 Repo

TL;DR

This paper proposes a ranking-based evaluation method for large language model answers in health, correlating well with expert judgments and addressing the challenge of scalable, open-ended answer assessment in sensitive domains.

Contribution

It introduces a ranking model approach for evaluating LLM answers in health, reducing reliance on costly expert annotations and demonstrating strong correlation with human preferences.

Findings

01

The ranking method correlates with human expert preferences (Kendall's τ=0.64).

02

Answer quality improves with larger models and advanced prompting.

03

The approach is effective on the CLEF 2021 eHealth dataset.

Abstract

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

webis-de/arxiv24-ranking-generated-answers
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Advanced Text Analysis Techniques · Spam and Phishing Detection

MethodsFocus