Eeny, meeny, miny, moe. How to choose data for morphological inflection
Saliha Muradoglu, Mans Hulden

TL;DR
This paper investigates active learning strategies for selecting data to improve morphological inflection in low-resource languages, demonstrating that confidence and entropy-based sampling enhance model performance, with oracle-based selection showing the most benefit.
Contribution
It compares four sampling strategies for morphological inflection across diverse languages, highlighting the effectiveness of confidence and entropy-based methods and analyzing their robustness.
Findings
Confidence and entropy-based sampling improve inflection accuracy.
Oracle-based selection yields the highest improvements.
Adding high-confidence or low-entropy forms can sometimes reduce performance.
Abstract
Data scarcity is a widespread problem in numerous natural language processing (NLP) tasks for low-resource languages. Within morphology, the labour-intensive work of tagging/glossing data is a serious bottleneck for both NLP and language documentation. Active learning (AL) aims to reduce the cost of data annotation by selecting data that is most informative for improving the model. In this paper, we explore four sampling strategies for the task of morphological inflection using a Transformer model: a pair of oracle experiments where data is chosen based on whether the model already can or cannot inflect the test forms correctly, as well as strategies based on high/low model confidence, entropy, as well as random selection. We investigate the robustness of each strategy across 30 typologically diverse languages. We also perform a more in-depth case study of Nat\"ugu. Our results show a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsMulti-Head Attention · Test · Adam · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Label Smoothing · Dense Connections · Attention Is All You Need · Absolute Position Encodings
