Revisiting Human-vs-LLM judgments using the TREC Podcast Track
Watheq Mansour, J. Shane Culpepper, Joel Mackenzie, and Andrew Yates

TL;DR
This study evaluates how well large language models agree with human experts in assessing relevance of podcast segments, revealing that LLMs often align more with humans than with traditional assessors, especially in audio-based retrieval tasks.
Contribution
It provides a novel analysis of LLM-human agreement in podcast relevance judgments, highlighting the impact of assessor disagreement on system evaluation in audio retrieval scenarios.
Findings
LLMs often agree more with human experts than with TREC assessors.
Disagreement between assessors affects system ranking reliability.
Audio-based relevance judgments show different agreement patterns than text-based ones.
Abstract
Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems
