C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling
Yutai Hou, Sanyuan Chen, Wanxiang Che, Cheng Chen, Ting Liu

TL;DR
C2C-GenDA is a novel data augmentation framework that jointly encodes and generates multiple semantically similar utterances to improve slot filling performance in spoken language understanding tasks.
Contribution
It introduces a cluster-to-cluster generation approach that enhances diversity and reduces duplication in data augmentation for slot filling.
Findings
Improves slot filling F-score by up to 13.6% on ATIS and Snips datasets.
Effectively enlarges training data with diverse, semantically consistent utterances.
Demonstrates significant gains with limited training data.
Abstract
Slot filling, a fundamental module of spoken language understanding, often suffers from insufficient quantity and diversity of training data. To remedy this, we propose a novel Cluster-to-Cluster generation framework for Data Augmentation (DA), named C2C-GenDA. It enlarges the training set by reconstructing existing utterances into alternative expressions while keeping semantic. Different from previous DA works that reconstruct utterances one by one independently, C2C-GenDA jointly encodes multiple existing utterances of the same semantics and simultaneously decodes multiple unseen expressions. Jointly generating multiple new utterances allows to consider the relations between generated instances and encourages diversity. Besides, encoding multiple existing utterances endows C2C with a wider view of existing expressions, helping to reduce generation that duplicates existing data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
