Unsilencing Colonial Archives via Automated Entity Recognition

Mrinalini Luthra; Konstantin Todorov; Charles Jeurgens; Giovanni; Colavizza

arXiv:2210.02194·cs.DL·October 6, 2022·1 cites

Unsilencing Colonial Archives via Automated Entity Recognition

Mrinalini Luthra, Konstantin Todorov, Charles Jeurgens, Giovanni, Colavizza

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces an automated entity recognition approach to enhance access to colonial archives by identifying marginalized individuals often omitted in traditional indexes, supported by a new annotated corpus and baseline models.

Contribution

It develops a specialized annotation typology, creates a large annotated corpus from the Dutch East India Company archives, and provides baseline neural network models for automated entity recognition.

Findings

01

Nearly 70,000 annotations released as shared task

02

Baseline neural models demonstrate effective entity recognition

03

Automated recognition can help broaden access to colonial archives

Abstract

Colonial archives are at the center of increased interest from a variety of perspectives, as they contain traces of historically marginalized people. Unfortunately, like most archives, they remain difficult to access due to significant persisting barriers. We focus here on one of them: the biases to be found in historical findings aids, such as indexes of person names, which remain in use to this day. In colonial archives, indexes can perpetuate silences by omitting to include mentions of historically marginalized persons. In order to overcome such limitations and pluralize the scope of existing finding aids, we propose using automated entity recognition. To this end, we contribute a fit-for-purpose annotation typology and apply it on the colonial archive of the Dutch East India Company (VOC). We release a corpus of nearly 70,000 annotations as a shared task, for which we provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

budh333/unsilence_voc
pytorchOfficial

Datasets

biglam/unsilence_voc
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management