Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies
Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova, Pedro Ortiz Suarez

TL;DR
This paper presents a practical tutorial for NLP practitioners to develop inclusive language technologies for low-resource languages, emphasizing data collection, modeling, and community engagement to address data scarcity and cultural diversity.
Contribution
It introduces a comprehensive toolkit and strategies for building NLP pipelines tailored to underrepresented languages, focusing on fairness, reproducibility, and real-world applicability.
Findings
Successful application across 10+ diverse languages
Effective methods for data collection and parallel sentence mining
Frameworks for fair and community-informed NLP development
Abstract
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT in Developing Communities · Natural Language Processing Techniques · Language and cultural evolution
