Knowing when to stop: insights from ecology for building catalogues, collections, and corpora
Jan Haji\v{c} jr., Fabian Moss

TL;DR
This paper applies ecological unseen species estimation methods to music cataloguing, specifically Gregorian chant, revealing coverage bounds and the potential for empirical assessment of repertoire completeness.
Contribution
It introduces the use of the Chao1 estimator in musicology, providing a novel quantitative approach to assess repertoire coverage and uncovering the extent of missing musical sources.
Findings
Repertoire coverage bounds range between 50% and 80%.
Mass Propers are better covered than Divine Office.
Approximately 5% of chants are newly discovered in recent sources.
Abstract
A major locus of musicological activity-increasingly in the digital domain-is the cataloguing of sources, which requires large-scale and long-lasting research collaborations. Yet, the databases aiming at covering and representing musical repertoires are never quite complete, and scholars must contend with the question: how much are we still missing? This question structurally resembles the 'unseen species' problem in ecology, where the true number of species must be estimated from limited observations. In this case study, we apply for the first time the common Chao1 estimator to music, specifically to Gregorian chant. We find that, overall, upper bounds for repertoire coverage of the major chant genres range between 50 and 80 %. As expected, we find that Mass Propers are covered better than the Divine Office, though not overwhelmingly so. However, the accumulation curve suggests that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
