Conserving Fuel in Statistical Language Learning: Predicting Data   Requirements

Mark Lauer (Microsoft Institute; Sydney)

arXiv:cmp-lg/9509002·cmp-lg·February 3, 2008·5 cites

Conserving Fuel in Statistical Language Learning: Predicting Data Requirements

Mark Lauer (Microsoft Institute, Sydney)

PDF

Open Access

TL;DR

This paper develops methods to predict the amount of training data needed for statistical language learning systems, combining theoretical bounds and simulations to better understand data requirements.

Contribution

It introduces a new accuracy estimation method for mode-based learners and explores data distribution effects on training data needs.

Findings

01

Derived bounds on expected accuracy based on training data volume

02

Proposed a computationally efficient approximation of the accuracy estimate

03

Conducted simulations showing the impact of non-uniform input distributions

Abstract

In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. Although these results are based on simplistic assumptions, they are a tentative step toward a useful theory of data requirements for SLL systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems