The Harrington Yowlumne Narrative Corpus
Nathan M. White, Timothy Henry-Rodriguez

TL;DR
This paper introduces the Harrington Yowlumne Narrative Corpus, a resource of 20 transcribed and linguistically annotated narratives from the Yowlumne community, aimed at supporting minority language NLP development.
Contribution
It provides a digitally transcribed, normalized, lemmatized, and POS-tagged corpus of minority language narratives, facilitating NLP research and community access to historical texts.
Findings
Corpus contains 57,136 characters and 10,719 words.
Automated and manual methods achieved high-quality normalization.
Resource supports minority language NLP development.
Abstract
Minority languages continue to lack adequate resources for their development, especially in the technological domain. Likewise, the J.P. Harrington Papers collection at the Smithsonian Institution are difficult to access in practical terms for community members and researchers due to its handwritten and disorganized format. Our current work seeks to make a portion of this publicly-available yet problematic material practically accessible for natural language processing use. Here, we present the Harrington Yowlumne Narrative Corpus, a corpus of 20 narrative texts that derive from the Tejone\~no Yowlumne community of the Tinliw rancheria in Kern County, California between 1910 and 1925. We digitally transcribe the texts and, through a Levenshtein distance-based algorithm and manual checking, we provide gold-standard aligned normalized and lemmatized text. We likewise provide POS tags for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship
