Stem-driven Language Models for Morphologically Rich Languages

Yash Shah; Ishan Tarunesh; Harsh Deshpande; Preethi Jyothi

arXiv:1910.11536·cs.CL·October 28, 2019

Stem-driven Language Models for Morphologically Rich Languages

Yash Shah, Ishan Tarunesh, Harsh Deshpande, Preethi Jyothi

PDF

Open Access

TL;DR

This paper introduces stem-aware neural language models for morphologically rich languages, leveraging unsupervised stem identification and multi-task learning to improve language modeling performance.

Contribution

It presents a novel approach of incorporating stem information directly into language models using unsupervised techniques and multi-task architectures for the first time.

Findings

01

Significant perplexity reduction in Hindi, Tamil, Kannada, and Finnish.

02

Effective use of unsupervised stem identification.

03

Improved language modeling for morphologically complex languages.

Abstract

Neural language models (LMs) have shown to benefit significantly from enhancing word vectors with subword-level information, especially for morphologically rich languages. This has been mainly tackled by providing subword-level information as an input; using subword units in the output layer has been far less explored. In this work, we propose LMs that are cognizant of the underlying stems in each word. We derive stems for words using a simple unsupervised technique for stem identification. We experiment with different architectures involving multi-task learning and mixture models over words and stems. We focus on four morphologically complex languages -- Hindi, Tamil, Kannada and Finnish -- and observe significant perplexity gains with using our stem-driven LMs when compared with other competitive baseline models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis