Leveraging Twitter for Low-Resource Conversational Speech Language   Modeling

Aaron Jaech; Mari Ostendorf

arXiv:1504.02490·cs.CL·April 13, 2015·1 cites

Leveraging Twitter for Low-Resource Conversational Speech Language Modeling

Aaron Jaech, Mari Ostendorf

PDF

Open Access

TL;DR

This paper presents a language-independent approach to augment low-resource conversational speech language models by harvesting large amounts of Twitter data, significantly reducing perplexity and improving vocabulary coverage.

Contribution

It introduces a simple method for collecting Twitter data to enhance low-resource language models and a technique to prioritize data collection using social and textual cues.

Findings

01

Significant perplexity reduction on four low-resource languages

02

Twitter data improves word class learning

03

Prioritized crawling increases useful data collection

Abstract

In applications involving conversational speech, data sparsity is a limiting factor in building a better language model. We propose a simple, language-independent method to quickly harvest large amounts of data from Twitter to supplement a smaller training set that is more closely matched to the domain. The techniques lead to a significant reduction in perplexity on four low-resource languages even though the presence on Twitter of these languages is relatively small. We also find that the Twitter text is more useful for learning word classes than the in-domain text and that use of these word classes leads to further reductions in perplexity. Additionally, we introduce a method of using social and textual information to prioritize the download queue during the Twitter crawling. This maximizes the amount of useful data that can be collected, impacting both perplexity and vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems