When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage   Natural Language Understanding Systems

Elias Stengel-Eskin; Emmanouil Antonios Platanios; Adam Pauls; Sam; Thomson; Hao Fang; Benjamin Van Durme; Jason Eisner; Yu Su

arXiv:2205.12228·cs.CL·November 9, 2022

When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam, Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, Yu Su

PDF

Open Access 1 Repo

TL;DR

This paper investigates how increasing training data in natural language understanding systems can paradoxically hinder learning new symbols, revealing a dilution effect and suggesting targeted data selection can improve performance.

Contribution

It is the first systematic study of incremental symbol learning in NLU, identifying the source signal dilution as a key challenge and proposing solutions to mitigate it.

Findings

01

Performance on new symbols decreases with larger datasets without more training data.

02

Source signal dilution causes reliance on lexical cues, impairing new symbol learning.

03

Selective data dropping can reverse the negative trend.

Abstract

In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on the new symbol often decreases if we do not accordingly increase its training data. This suggests that it becomes more difficult to learn new symbols with a larger training dataset. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

esteng/calibration_miso
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems