The Open corpus of the Veps and Karelian languages: overview and applications
Tatyana Boyko, Nina Zaitseva, Natalia Krizhanovskaya, Andrew, Krizhanovsky, Irina Novak, Nataliya Pellinen, Aleksandra Rodionova

TL;DR
This paper presents the development and features of the VepKar corpus, a comprehensive digital resource for Veps and Karelian languages, enabling linguistic research and language preservation through advanced search and annotation tools.
Contribution
It introduces the VepKar corpus with its multifunctional tools, extensive text collection, and plans for speech and syntactic modules, advancing corpus linguistics for these languages.
Findings
Corpus includes 3000 texts in Veps and Karelian.
Implemented advanced search and classification features.
Plans for speech and syntactic analysis modules.
Abstract
A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
