Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering

Sanidhya Vijayvargiya; Xuhui Zhou; Akhila Yerukola; Maarten Sap; Graham Neubig

arXiv:2502.13069·cs.AI·February 24, 2026

Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper evaluates how large language model agents handle underspecified instructions in software engineering tasks, emphasizing the importance of interaction and clarification to improve performance and reduce risks.

Contribution

Introduces Ambig-SWE, a benchmark for evaluating agent behavior under ambiguity, and analyzes the ability of models to detect underspecificity, ask clarifying questions, and improve outcomes through interaction.

Findings

01

Models struggle to identify underspecific instructions.

02

Interactive clarification significantly improves performance, up to 74%.

03

Current models have critical gaps in handling missing information.

Abstract

AI agents are increasingly being deployed to automate tasks, often based on underspecified user instructions. Making unwarranted assumptions to compensate for the missing information and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle underspecified instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) detecting underspecificity, (b) asking targeted clarification questions, and (c) leveraging the interaction to improve performance in underspecified scenarios. We introduce Ambig-SWE, an underspecified variant of SWE-Bench Verified, specifically designed to evaluate agent behavior under ambiguity and interaction. Our findings reveal that models…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. In my view, RQ1 and RQ2 are really interesting and well-scoped. I appreciate the experimental design, which attempts to build comparable evaluations across "Hidden", "Interaction" and "Full" settings. The primary results in Figure 3 are super interesting, and I believe potentially very influential in the field of software-engineering evaluations. However, I would really like to see a more comprehensive set of models here. 2. The construction of the dataset is well-described and I appreciate t

Weaknesses

1. Outdated models: Both proprietary (Claude-3.5) and open-source models (DeepSeek-v2 and Llama-70b) are unfortunately rather behind the state of the art, given the rate of progress in recent months. Just for Claude models alone, we've seen Sonnet-3.7, Sonnet-4, and Sonnet-4.5 in the meantime. For this paper to be relevant for a conference presentation, I fear that the paper would really require updated results from more up-to-date frontier models. This will also make the claims around open-weig

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper addresses a practically important problem. Real-world task descriptions are often incomplete, and understanding how agents handle this is valuable. 2. The experimental design is generally rigorous.

Weaknesses

1. The most significant weakness is the lack of human validation for the synthetic underspecified issues. The authors use GPT-4o to generate summaries but provide no evidence that these summaries would actually prevent human experts from solving the tasks. Are the findings representative of real underspecification? 2. The classification of missing information into only "informational" and "navigational" details is overly simplistic. The authors mention "multiple, interdependent gaps" in real tas

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper explores a critical issue within current LLMs where they typically cannot recognize the underspecioficity in user query. 2. The paper clearly defines underspecificity as “missing information that prevents an expert from producing a correct fix,” grounding it in the SWE-Bench Verified rubric rather than using vague notions of ambiguity 3. The study divides performance into three measurable capabilities: 1) detecting underspecificity, 2) asking targeted questions, and 3) leveraging re

Weaknesses

1. The paper admits that naturally underspecified GitHub issues often still contain concrete technical cues (error messages, file references, conversational fragments), whereas the generated summaries mainly remove details, which may exaggerate the severity of underspecificity and may bias the task toward “missing vital context” rather than “ambiguous intent.” 2. The paper mentions using the OpenHands agent environment but gives minimal explanation of how the agent framework is structured or how

Code & Models

Repositories

sani903/interactivesweagents
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Business Process Modeling and Analysis · Service-Oriented Architecture and Web Services