Potential and Perils of Large Language Models as Judges of Unstructured   Textual Data

Rewina Bedemariam; Natalie Perez; Sreyoshi Bhaduri; Satya Kapoor; Alex; Gil; Elizabeth Conjar; Ikkei Itoku; David Theil; Aman Chadha; Naumaan Nayyar

arXiv:2501.08167·cs.CL·January 22, 2025·3 cites

Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

Rewina Bedemariam, Natalie Perez, Sreyoshi Bhaduri, Satya Kapoor, Alex, Gil, Elizabeth Conjar, Ikkei Itoku, David Theil, Aman Chadha, Naumaan Nayyar

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of large language models as automated judges for assessing the thematic accuracy of summaries generated from unstructured survey data, comparing their performance to human evaluators.

Contribution

It introduces a scalable LLM-as-judge framework for evaluating thematic summaries, validating its effectiveness against traditional human assessments.

Findings

01

LLM-as-judge approaches are comparable to human raters in evaluating summaries.

02

Humans outperform LLMs in detecting subtle contextual nuances.

03

The study highlights the potential and limitations of using LLMs for automated evaluation.

Abstract

Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLLaMA