Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data
Alana Marzoev, Samuel Madden, M. Frans Kaashoek, Michael Cafarella,, Jacob Andreas

TL;DR
This paper presents a simulation-to-real transfer method in NLP that enables models to interpret natural language without requiring natural training data, by using synthetic data and learned sentence embeddings.
Contribution
The authors introduce a novel transfer technique that leverages synthetic data and embedding-based projections to bridge the gap between synthetic and natural language understanding.
Findings
Our approach matches or outperforms state-of-the-art models trained on natural data.
Synthetic data alone can be sufficient for effective language understanding.
The method demonstrates broad applicability across multiple NLP domains.
Abstract
Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
