TL;DR
This paper introduces a new data collection method for semantic parsing that combines crowdsourcing with a paraphrase model, significantly improving accuracy by addressing distribution mismatches in data collection.
Contribution
It identifies key distribution mismatches in existing data collection methods and proposes a novel approach that leverages unlabeled data and paraphrasing to enhance semantic parsing accuracy.
Findings
Achieved 70.6% accuracy on true data distribution.
Outperformed traditional paraphrasing-based methods with 51.3% accuracy.
Effectively mitigated distribution mismatch issues.
Abstract
A major hurdle on the road to conversational interfaces is the difficulty in collecting data that maps language utterances to logical forms. One prominent approach for data collection has been to automatically generate pseudo-language paired with logical forms, and paraphrase the pseudo-language to natural language through crowdsourcing (Wang et al., 2015). However, this data collection procedure often leads to low performance on real data, due to a mismatch between the true distribution of examples and the distribution induced by the data collection procedure. In this paper, we thoroughly analyze two sources of mismatch in this process: the mismatch in logical form distribution and the mismatch in language distribution between the true and induced distributions. We quantify the effects of these mismatches, and propose a new data collection approach that mitigates them. Assuming access…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
