Unsilencing Colonial Archives via Automated Entity Recognition
Mrinalini Luthra, Konstantin Todorov, Charles Jeurgens, Giovanni, Colavizza

TL;DR
This paper introduces an automated entity recognition approach to enhance access to colonial archives by identifying marginalized individuals often omitted in traditional indexes, supported by a new annotated corpus and baseline models.
Contribution
It develops a specialized annotation typology, creates a large annotated corpus from the Dutch East India Company archives, and provides baseline neural network models for automated entity recognition.
Findings
Nearly 70,000 annotations released as shared task
Baseline neural models demonstrate effective entity recognition
Automated recognition can help broaden access to colonial archives
Abstract
Colonial archives are at the center of increased interest from a variety of perspectives, as they contain traces of historically marginalized people. Unfortunately, like most archives, they remain difficult to access due to significant persisting barriers. We focus here on one of them: the biases to be found in historical findings aids, such as indexes of person names, which remain in use to this day. In colonial archives, indexes can perpetuate silences by omitting to include mentions of historically marginalized persons. In order to overcome such limitations and pluralize the scope of existing finding aids, we propose using automated entity recognition. To this end, we contribute a fit-for-purpose annotation typology and apply it on the colonial archive of the Dutch East India Company (VOC). We release a corpus of nearly 70,000 annotations as a shared task, for which we provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
