Finding needles in a haystack: Sampling Structurally-diverse Training   Sets from Synthetic Data for Compositional Generalization

Inbar Oren; Jonathan Herzig; Jonathan Berant

arXiv:2109.02575·cs.CL·September 7, 2021

Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

Inbar Oren, Jonathan Herzig, Jonathan Berant

PDF

1 Repo

TL;DR

This paper proposes a method for selecting structurally-diverse synthetic training examples to significantly improve compositional generalization in semantic parsing, achieving high data efficiency with fewer examples.

Contribution

It introduces a novel sampling approach for synthetic data that enhances compositional generalization and data efficiency in semantic parsing tasks.

Findings

01

Dramatic improvements in compositional generalization.

02

Moderate gains in traditional i.i.d. settings.

03

200x data efficiency with fewer examples.

Abstract

Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. Given a small training set of annotated examples and an "infinite" pool of synthetic examples, we select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

inbaroren/scfg-sampling-for-comp-gen
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.