Metadata Enrichment of Long Text Documents using Large Language Models

Manika Lamba; You Peng; Sophie Nikolov; Glen Layne-Worthey; J. Stephen Downie

arXiv:2506.20918·cs.DL·June 27, 2025

Metadata Enrichment of Long Text Documents using Large Language Models

Manika Lamba, You Peng, Sophie Nikolov, Glen Layne-Worthey, J. Stephen Downie

PDF

Open Access

TL;DR

This paper demonstrates how large language models can semantically enrich metadata of long text documents like theses and dissertations, improving searchability and access in digital repositories.

Contribution

It introduces a novel method combining manual efforts and LLMs to enhance metadata quality for long texts in digital libraries.

Findings

01

Metadata enrichment improves search and access in digital repositories.

02

LLMs effectively add valuable metadata access points.

03

The approach benefits repositories with missing or incomplete metadata.

Abstract

In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies