Stem-driven Language Models for Morphologically Rich Languages
Yash Shah, Ishan Tarunesh, Harsh Deshpande, Preethi Jyothi

TL;DR
This paper introduces stem-aware neural language models for morphologically rich languages, leveraging unsupervised stem identification and multi-task learning to improve language modeling performance.
Contribution
It presents a novel approach of incorporating stem information directly into language models using unsupervised techniques and multi-task architectures for the first time.
Findings
Significant perplexity reduction in Hindi, Tamil, Kannada, and Finnish.
Effective use of unsupervised stem identification.
Improved language modeling for morphologically complex languages.
Abstract
Neural language models (LMs) have shown to benefit significantly from enhancing word vectors with subword-level information, especially for morphologically rich languages. This has been mainly tackled by providing subword-level information as an input; using subword units in the output layer has been far less explored. In this work, we propose LMs that are cognizant of the underlying stems in each word. We derive stems for words using a simple unsupervised technique for stem identification. We experiment with different architectures involving multi-task learning and mixture models over words and stems. We focus on four morphologically complex languages -- Hindi, Tamil, Kannada and Finnish -- and observe significant perplexity gains with using our stem-driven LMs when compared with other competitive baseline models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
