Submodularity-Inspired Data Selection for Goal-Oriented Chatbot Training Based on Sentence Embeddings
Mladen Dimovski, Claudiu Musat, Vladimir Ilievski, Andreea Hossmann,, Michael Baeriswyl

TL;DR
This paper introduces a submodularity-inspired data selection method using sentence embeddings to efficiently train goal-oriented chatbots with less labeled data, outperforming existing active learning techniques.
Contribution
It proposes a novel data ranking function based on embedding distances that reduces labeling costs without requiring model retraining during selection.
Findings
Outperforms two known active learning methods in data selection.
Enables cost-efficient training with fewer labeled sentences.
Does not require model retraining during data selection.
Abstract
Spoken language understanding (SLU) systems, such as goal-oriented chatbots or personal assistants, rely on an initial natural language understanding (NLU) module to determine the intent and to extract the relevant information from the user queries they take as input. SLU systems usually help users to solve problems in relatively narrow domains and require a large amount of in-domain training data. This leads to significant data availability issues that inhibit the development of successful systems. To alleviate this problem, we propose a technique of data selection in the low-data regime that enables us to train with fewer labeled sentences, thus smaller labelling costs. We propose a submodularity-inspired data ranking function, the ratio-penalty marginal gain, for selecting data points to label based only on the information extracted from the textual embedding space. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
