Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of Topic Models

Zongxia Li; Lorena Calvo-Bartolom\'e; Alexander Hoyle; Paiheng Xu; Alden Dima; Juan Francisco Fung; Jordan Boyd-Graber

arXiv:2502.14748·cs.CL·June 5, 2025

Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of Topic Models

Zongxia Li, Lorena Calvo-Bartolom\'e, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber

PDF

Open Access 3 Datasets

TL;DR

This paper evaluates the effectiveness of large language models in understanding large document collections, highlighting their strengths in generating readable topics but also their limitations without human supervision, especially for domain-specific data.

Contribution

It provides a comparative analysis of LLM-based and traditional topic models, emphasizing the need for human oversight to improve LLM performance in real-world data exploration.

Findings

01

LLMs produce more human-readable topics than traditional models.

02

Adding human supervision improves LLM data exploration.

03

LMMs struggle with domain-specific data and hallucination issues.

Abstract

A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law