Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation
Changbing Yang, Garrett Nicolai

TL;DR
This paper presents a transformer-based morpheme segmentation system that combines multitask learning with synthetic data generated by large language models to improve performance in low-resource languages.
Contribution
It introduces a novel framework that jointly predicts segments and glosses, leveraging synthetic data and multitask learning to enhance low-resource morpheme segmentation.
Findings
Significant improvement in segmentation accuracy on SIGMORPHON 2023 dataset
Enhanced model generalization across multiple low-resource languages
Effective use of LLM-generated synthetic data for low-resource NLP tasks
Abstract
We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling
