CST5: Data Augmentation for Code-Switched Semantic Parsing
Anmol Agarwal, Jigar Gupta, Rahul Goel, Shyam Upadhyay, Pankaj Joshi,, Rengarajan Aravamudhan

TL;DR
This paper introduces CST5, a data augmentation method that uses a fine-tuned T5 model to generate high-quality code-switched data, significantly reducing the need for labeled data in semantic parsing tasks.
Contribution
We propose CST5, a novel data augmentation technique for code-switched semantic parsing, and release the largest Hinglish dataset along with a large set of generated code-switched utterances.
Findings
CST5 achieves comparable performance with 20x less labeled data.
Generated data is of high quality based on human evaluation.
The approach significantly improves semantic parsing in code-switched contexts.
Abstract
Extending semantic parsers to code-switched input has been a challenging problem, primarily due to a lack of supervised training data. In this work, we introduce CST5, a new data augmentation technique that finetunes a T5 model using a small seed set (100 utterances) to generate code-switched utterances from English utterances. We show that CST5 generates high quality code-switched data, both intrinsically (per human evaluation) and extrinsically by comparing baseline models which are trained without data augmentation to models which are trained with augmented data. Empirically we observe that using CST5, one can achieve the same semantic parsing performance by using up to 20x less labeled data. To aid further research in this area, we are also releasing (a) Hinglish-TOP, the largest human annotated code-switched semantic parsing dataset to date, containing 10k human annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Dropout · Attention Dropout · Dense Connections · Gated Linear Unit · Adafactor
