Using language models to label clusters of scientific documents

Dakota Murray; Chaoqun Ni; Weiye Gu; Trevor Hubbard

arXiv:2511.02601·cs.DL·November 11, 2025

Using language models to label clusters of scientific documents

Dakota Murray, Chaoqun Ni, Weiye Gu, Trevor Hubbard

PDF

Open Access

TL;DR

This paper explores how generative language models like ChatGPT can automatically produce human-readable labels for scientific document clusters, improving interpretability in bibliometric workflows.

Contribution

It defines and formalizes the task of descriptive label generation, proposes a structured workflow, and develops an evaluative framework for assessing label quality.

Findings

01

Language models generate labels comparable to characteristic labels.

02

The proposed framework effectively evaluates descriptive labels.

03

Design considerations influence label quality and applicability.

Abstract

Automated label generation for clusters of scientific documents is a common task in bibliometric workflows. Traditionally, labels were formed by concatenating distinguishing characteristics of a cluster's documents; while straightforward, this approach often produces labels that are terse and difficult to interpret. The advent and widespread accessibility of generative language models, such as ChatGPT, make it possible to automatically generate descriptive and human-readable labels that closely resemble those assigned by human annotators. Language-model label generation has already seen widespread use in bibliographic databases and analytical workflows. However, its rapid adoption has outpaced the theoretical, practical, and empirical foundations. In this study, we address the automated label generation task and make four key contributions: (1) we define two distinct types of labels:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Topic Modeling