Revisiting Human-vs-LLM judgments using the TREC Podcast Track

Watheq Mansour; J. Shane Culpepper; Joel Mackenzie; and Andrew Yates

arXiv:2601.05603·cs.IR·January 15, 2026

Revisiting Human-vs-LLM judgments using the TREC Podcast Track

Watheq Mansour, J. Shane Culpepper, Joel Mackenzie, and Andrew Yates

PDF

Open Access

TL;DR

This study evaluates how well large language models agree with human experts in assessing relevance of podcast segments, revealing that LLMs often align more with humans than with traditional assessors, especially in audio-based retrieval tasks.

Contribution

It provides a novel analysis of LLM-human agreement in podcast relevance judgments, highlighting the impact of assessor disagreement on system evaluation in audio retrieval scenarios.

Findings

01

LLMs often agree more with human experts than with TREC assessors.

02

Disagreement between assessors affects system ranking reliability.

03

Audio-based relevance judgments show different agreement patterns than text-based ones.

Abstract

Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems