Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
C.M. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral

TL;DR
This paper demonstrates that unsupervised sequence segmentation can be effectively transferred to extremely low-resource languages through multilingual pre-training of a Masked Segmental Language Model, especially benefiting small datasets and zero-shot scenarios.
Contribution
It introduces a method for transferring unsupervised segmentation capabilities to low-resource languages via multilingual pre-training on typologically similar languages.
Findings
Multilingual pre-training improves segmentation over monolingual models in most settings.
The approach achieves a zero-shot F1 score of 20.6.
Performance remains robust across various small dataset sizes.
Abstract
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Genomics and Phylogenetic Studies · Language and cultural evolution
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
