Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions
Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast,, Harrisen Scells

TL;DR
This paper proposes a ranking-based evaluation method for large language model answers in health, correlating well with expert judgments and addressing the challenge of scalable, open-ended answer assessment in sensitive domains.
Contribution
It introduces a ranking model approach for evaluating LLM answers in health, reducing reliance on costly expert annotations and demonstrating strong correlation with human preferences.
Findings
The ranking method correlates with human expert preferences (Kendall's τ=0.64).
Answer quality improves with larger models and advanced prompting.
The approach is effective on the CLEF 2021 eHealth dataset.
Abstract
Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Advanced Text Analysis Techniques · Spam and Phishing Detection
MethodsFocus
