An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza; Sameer Sadruddin; Maximilian K\"ahler; Andrea Salfinger; Luca Zaccagna; Francesca Incitti; Lauro Snidaro; Osma Suominen

arXiv:2603.10876·cs.CL·March 12, 2026

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza, Sameer Sadruddin, Maximilian K\"ahler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

PDF

Open Access

TL;DR

This paper introduces a large bilingual dataset for multi-label text classification in digital libraries, enabling ontology-aware classification and supporting the development of AI tools that assist catalogers with authority-grounded evaluation.

Contribution

It provides a new bilingual corpus with authority annotations and a GND taxonomy, facilitating research in ontology-aware classification and AI-assisted cataloging.

Findings

01

Statistical profile of the dataset provided

02

Qualitative error analysis of three classification systems conducted

03

Community invited to evaluate accuracy, usefulness, and transparency

Abstract

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Library Science and Information Systems · Information Retrieval and Search Behavior