Enhancing Out-Of-Domain Utterance Detection with Data Augmentation Based on Word Embeddings
Yueqi Feng, Jiali Lin

TL;DR
This paper explores how data augmentation using word embeddings can improve out-of-domain utterance detection in intelligent assistants, especially with limited OOD data, by increasing sample diversity and dispersion.
Contribution
It introduces a sampling-based data augmentation method that enhances OOD detection accuracy by increasing the dispersion of OOD samples.
Findings
Dispersed OOD samples improve detection performance
Augmentation benefits are more significant with small sample sizes
Random sampling increases coverage of unknown OOD space
Abstract
For most intelligent assistant systems, it is essential to have a mechanism that detects out-of-domain (OOD) utterances automatically to handle noisy input properly. One typical approach would be introducing a separate class that contains OOD utterance examples combined with in-domain text samples into the classifier. However, since OOD utterances are usually unseen to the training datasets, the detection performance largely depends on the quality of the attached OOD text data with restricted sizes of samples due to computing limits. In this paper, we study how augmented OOD data based on sampling impact OOD utterance detection with a small sample size. We hypothesize that OOD utterance samples chosen randomly can increase the coverage of unknown OOD utterance space and enhance detection accuracy if they are more dispersed. Experiments show that given the same dataset with the same OOD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
