Conserving Fuel in Statistical Language Learning: Predicting Data Requirements
Mark Lauer (Microsoft Institute, Sydney)

TL;DR
This paper develops methods to predict the amount of training data needed for statistical language learning systems, combining theoretical bounds and simulations to better understand data requirements.
Contribution
It introduces a new accuracy estimation method for mode-based learners and explores data distribution effects on training data needs.
Findings
Derived bounds on expected accuracy based on training data volume
Proposed a computationally efficient approximation of the accuracy estimate
Conducted simulations showing the impact of non-uniform input distributions
Abstract
In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. Although these results are based on simplistic assumptions, they are a tentative step toward a useful theory of data requirements for SLL systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
