Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Changbing Yang; Garrett Nicolai

arXiv:2505.16800·cs.CL·May 23, 2025

Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Changbing Yang, Garrett Nicolai

PDF

Open Access

TL;DR

This paper presents a transformer-based morpheme segmentation system that combines multitask learning with synthetic data generated by large language models to improve performance in low-resource languages.

Contribution

It introduces a novel framework that jointly predicts segments and glosses, leveraging synthetic data and multitask learning to enhance low-resource morpheme segmentation.

Findings

01

Significant improvement in segmentation accuracy on SIGMORPHON 2023 dataset

02

Enhanced model generalization across multiple low-resource languages

03

Effective use of LLM-generated synthetic data for low-resource NLP tasks

Abstract

We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling