Leveraging Twitter for Low-Resource Conversational Speech Language Modeling
Aaron Jaech, Mari Ostendorf

TL;DR
This paper presents a language-independent approach to augment low-resource conversational speech language models by harvesting large amounts of Twitter data, significantly reducing perplexity and improving vocabulary coverage.
Contribution
It introduces a simple method for collecting Twitter data to enhance low-resource language models and a technique to prioritize data collection using social and textual cues.
Findings
Significant perplexity reduction on four low-resource languages
Twitter data improves word class learning
Prioritized crawling increases useful data collection
Abstract
In applications involving conversational speech, data sparsity is a limiting factor in building a better language model. We propose a simple, language-independent method to quickly harvest large amounts of data from Twitter to supplement a smaller training set that is more closely matched to the domain. The techniques lead to a significant reduction in perplexity on four low-resource languages even though the presence on Twitter of these languages is relatively small. We also find that the Twitter text is more useful for learning word classes than the in-domain text and that use of these word classes leads to further reductions in perplexity. Additionally, we introduce a method of using social and textual information to prioritize the download queue during the Twitter crawling. This maximizes the amount of useful data that can be collected, impacting both perplexity and vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
