Zero and Few-shot Semantic Parsing with Ambiguous Inputs
Elias Stengel-Eskin, Kyle Rawlins, Benjamin Van Durme

TL;DR
This paper introduces AmP, a framework and dataset for translating ambiguous natural language into formal representations, revealing that large models struggle with ambiguity unless explicitly instructed, highlighting the need for explicit ambiguity modeling.
Contribution
The paper presents AmP, a novel dataset and challenge for handling ambiguity in semantic parsing, and evaluates how models manage ambiguous inputs with new metrics.
Findings
Large pre-trained models perform poorly without explicit instruction.
Models capture meaning distribution well when ambiguity is in inputs.
Including ambiguity explicitly improves model understanding and evaluation.
Abstract
Despite the frequent challenges posed by ambiguity when representing meaning via natural language, it is often ignored or deliberately removed in tasks mapping language to formally-designed representations, which generally assume a one-to-one mapping between linguistic and formal representations. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for translating ambiguous natural language to formal representations like logic and code. We define templates and generate data for five well-documented linguistic ambiguities. Using AmP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics. We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction. However, models are able to capture the distribution well when ambiguity is…
Peer Reviews
Decision·ICLR 2024 poster
1. The AMP dataset is a significant contribution, providing a resource specifically designed for investigating ambiguity in semantic parsing, which is a relatively unexplored area. 2. The paper takes a comprehensive approach by addressing the challenge from the perspective of both dataset creation and model evaluation. 3. The introduction of zero-shot and few-shot tasks offers a rigorous evaluation framework for future research on ambiguity in semantic parsing. 4. The development of new metri
1. While the paper provides a strong foundation, it could benefit from a more detailed exploration of how ambiguity affects real-world applications of semantic parsing. 2. The AMP dataset, while novel, might still be limited in scope and diversity, potentially affecting the robustness of the study’s conclusions. 3. It is unclear how the proposed methods deal with the dynamic nature of conversational context, which can significantly affect ambiguity resolution.
* The paper aims at an important problem (handling ambiguity in semantic parsing). * The setup is clever and allows for some interesting analyses. I think the looking at the token-level confidences to see how model uncertainty is reflected in ambiguity-resolution–dependent choice points, as done in Figure 5, is a useful idea. * The comparison to human behavior (Section 3.2) is interesting and I imagine could seed future experiments.
I'm worried about how we assign meaning to the various results, and I'm not sure how this result would feed into future work that helps parsers handle ambiguity better. 1. Human experiments: humans were given both interpretations and asked to assign confidences to them. This seems a bit different from what the models were asked to do in the zero-shot experiments, which is implicitly pick out the ambiguity on their own. I understand it'd be hard to elicit this kind of behavior from humans — idea
1. The motivation and writing are very clear. The paper is generally easy to follow. 2. I like the human probability vs. model probability experiments personally, and seeing that humans have certain preferences on one interpretation than the other is interesting, and model prediction somehow matches it as well is very interesting too.
1. The generation task is hard, especially generating logical forms. Why not formulate this as a multi-choice problem? Letting the model choose two from 10 possible combinations? 2. Is there any quantitative analysis? What kind of errors does the model usually make? 3. The evaluation metric can be improved. I have several questions about this. Why not use the same zero-shot and few-shot metric since the output format is the same? Why not use language interpretation instead of LF generations? La
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
