Potential and Perils of Large Language Models as Judges of Unstructured Textual Data
Rewina Bedemariam, Natalie Perez, Sreyoshi Bhaduri, Satya Kapoor, Alex, Gil, Elizabeth Conjar, Ikkei Itoku, David Theil, Aman Chadha, Naumaan Nayyar

TL;DR
This paper evaluates the effectiveness of large language models as automated judges for assessing the thematic accuracy of summaries generated from unstructured survey data, comparing their performance to human evaluators.
Contribution
It introduces a scalable LLM-as-judge framework for evaluating thematic summaries, validating its effectiveness against traditional human assessments.
Findings
LLM-as-judge approaches are comparable to human raters in evaluating summaries.
Humans outperform LLMs in detecting subtle contextual nuances.
The study highlights the potential and limitations of using LLMs for automated evaluation.
Abstract
Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLLaMA
