TL;DR
This paper introduces ToM-SWE, a dual-agent system that models user mental states to improve software engineering tasks, significantly enhancing success rates and user satisfaction in benchmarks and real-world developer studies.
Contribution
We propose a novel dual-agent architecture with a theory-of-mind partner that models user intent, improving task success and user experience in coding agents.
Findings
ToM-SWE achieves a 59.7% success rate on the stateful SWE benchmark.
Participants found ToM-SWE useful 86% of the time in real developer studies.
ToM-SWE outperforms existing SWE agents like OpenHands in success rate.
Abstract
Recent advances in coding agents have made them capable of planning, editing, running, and testing complex code bases. Despite their growing ability in coding tasks, these systems still struggle to infer and track user intent, especially when instructions are underspecified or context-dependent. To bridge this gap, we introduce ToM-SWE, a dual-agent architecture that pairs a primary software-engineering (SWE) agent with a lightweight theory-of-mind (ToM) partner agent dedicated to modeling the user's mental state. The ToM agent infers user goals, constraints, and preferences from instructions and interaction history, maintains a \textbf{persistent memory} of the user, and provides user-related suggestions to the SWE agent. In two software engineering benchmarks (ambiguous SWE-bench and stateful SWE-bench), ToM-SWE improves task success rates and user satisfaction. Notably, on the…
Peer Reviews
Decision·Submitted to ICLR 2026
The issue that is addressed by the contribution is relevant: Human work and interaction with agents is sequential, and having agents learn and persist user preferences across individual tasks is an important part. To measure performance on sequential tasks, the paper introduce a new benchmark SWE-bench sequential, that allows to study tasks in sequence, and to model interaction with an opinionated human developers. While some details are still unclear to me, this seems like an original and usef
## Readability * For some time I did not understand what the actual score from Stateful SWE-bench score as shown in Fig. 4 actually is. The paper states (Section 3) it's the "user simulator satisfaction scores" and points to appendix A.6.2, but A.6.2 lists 5 different scores. But then in Fig. 4., what is shown (if I follow the main text) is actually the unittest based resolution rate and the "satistfaction score" is an extra score that is shown in Table 1. This could be make clear. * Fig 1: I r
1. Novel hierarchical agentic architecture that helps improve user satisfaction on SWE-style tasks. 2. Contribution of a new benchmark, Stateful SWE benchmark, which evaluates how well agents sustain meaningful interactions over time (evaluates long-term memory demands) with an interesting user-simulator based approach. 3. Human study with real-world developers to validate ToM agent’s importance is very novel and shows strong results of ToM and user satisfaction.
1. From my read of Section 3, I could not immediately tell what the difference of stateful SWE-bench from SWE-bench was in terms of problem_statement or other task inputs and outputs: I think this section would benefit from a diagram showing how SWE-bench issues were mapped and a specific example of what an instance of the mapping looks like (i.e. is the problem_statement modified and if so, in what way?). 1a. Also, how big is this new benchmark, is it 15 x 500 = 7500 total instances? If that’s
- The paper makes a good case for improving user modeling for interactive agents software engineering scenarios. - The paper establishes simple but meaningful categories for common user preferences in software engineering agents. - The proposed augmentation with the ToM agent performs well on interactive SWE-bench variants, as well as receiving good feedback in an online trial with human software engineering.
- The paper claims that the advantage of the two agent solution is ”reduced context distraction and specialized optimization”. This seems plausible but I believe the paper provides limited evidence beyond the simple RAG baseline (which already performs quite well with Claude 4). For example, I’d imagine one could prompt the main agent better to think about user intents, or the user profile analysis could be an offline step (aggregating past sessions + user profile → updated user profile), that
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
