A Digital Corpus of St. Lawrence Island Yupik
Lane Schwartz, Emily Chen, Hyunji Hayley Park, Edward Jahn, and Sylvia L.R. Schreiner

TL;DR
This paper introduces the first digital corpus of St. Lawrence Island Yupik, created through a detailed digitization pipeline, aimed at supporting linguistic research, language revitalization, and technological development for the endangered language.
Contribution
It presents a novel digitization pipeline and the first publicly available digital corpus for St. Lawrence Island Yupik, facilitating linguistic research and language technology development.
Findings
Created the first digital corpus for Yupik language
Demonstrated potential for NLP and language revitalization tools
Enabled easier access to Yupik texts for community and educators
Abstract
St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
