When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam, Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, Yu Su

TL;DR
This paper investigates how increasing training data in natural language understanding systems can paradoxically hinder learning new symbols, revealing a dilution effect and suggesting targeted data selection can improve performance.
Contribution
It is the first systematic study of incremental symbol learning in NLU, identifying the source signal dilution as a key challenge and proposing solutions to mitigate it.
Findings
Performance on new symbols decreases with larger datasets without more training data.
Source signal dilution causes reliance on lexical cues, impairing new symbol learning.
Selective data dropping can reverse the negative trend.
Abstract
In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on the new symbol often decreases if we do not accordingly increase its training data. This suggests that it becomes more difficult to learn new symbols with a larger training dataset. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
