An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?
Jennifer D'Souza, Sameer Sadruddin, Maximilian K\"ahler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

TL;DR
This paper introduces a large bilingual dataset for multi-label text classification in digital libraries, enabling ontology-aware classification and supporting the development of AI tools that assist catalogers with authority-grounded evaluation.
Contribution
It provides a new bilingual corpus with authority annotations and a GND taxonomy, facilitating research in ontology-aware classification and AI-assisted cataloging.
Findings
Statistical profile of the dataset provided
Qualitative error analysis of three classification systems conducted
Community invited to evaluate accuracy, usefulness, and transparency
Abstract
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Library Science and Information Systems · Information Retrieval and Search Behavior
