CST5: Data Augmentation for Code-Switched Semantic Parsing

Anmol Agarwal; Jigar Gupta; Rahul Goel; Shyam Upadhyay; Pankaj Joshi,; Rengarajan Aravamudhan

arXiv:2211.07514·cs.CL·November 15, 2022·5 cites

CST5: Data Augmentation for Code-Switched Semantic Parsing

Anmol Agarwal, Jigar Gupta, Rahul Goel, Shyam Upadhyay, Pankaj Joshi,, Rengarajan Aravamudhan

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces CST5, a data augmentation method that uses a fine-tuned T5 model to generate high-quality code-switched data, significantly reducing the need for labeled data in semantic parsing tasks.

Contribution

We propose CST5, a novel data augmentation technique for code-switched semantic parsing, and release the largest Hinglish dataset along with a large set of generated code-switched utterances.

Findings

01

CST5 achieves comparable performance with 20x less labeled data.

02

Generated data is of high quality based on human evaluation.

03

The approach significantly improves semantic parsing in code-switched contexts.

Abstract

Extending semantic parsers to code-switched input has been a challenging problem, primarily due to a lack of supervised training data. In this work, we introduce CST5, a new data augmentation technique that finetunes a T5 model using a small seed set ( $\approx$ 100 utterances) to generate code-switched utterances from English utterances. We show that CST5 generates high quality code-switched data, both intrinsically (per human evaluation) and extrinsically by comparing baseline models which are trained without data augmentation to models which are trained with augmented data. Empirically we observe that using CST5, one can achieve the same semantic parsing performance by using up to 20x less labeled data. To aid further research in this area, we are also releasing (a) Hinglish-TOP, the largest human annotated code-switched semantic parsing dataset to date, containing 10k human annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/hinglish-top-dataset
noneOfficial

Datasets

WillHeld/hinglish_top
dataset· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Dropout · Attention Dropout · Dense Connections · Gated Linear Unit · Adafactor