BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong

TL;DR
This paper introduces BOAD, a method that automatically discovers hierarchical multi-agent systems for software engineering tasks using bandit optimization, leading to improved generalization and performance on complex, out-of-distribution problems.
Contribution
The paper proposes a novel framework, BOAD, that formulates hierarchy discovery as a multi-armed bandit problem to automatically optimize sub-agent structures for SWE tasks.
Findings
BOAD outperforms single-agent and manually designed multi-agent systems on SWE-bench-Verified.
BOAD's 36B system ranks second on SWE-bench-Live, surpassing larger models like GPT-4.
Automatically discovered hierarchies improve generalization on long-horizon SWE problems.
Abstract
Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is very well written and I thank the authors for explicitly stating the MAB formulation in the context of the code generation task. - In terms of novelty, I was pretty surprised but a multi-arm bandit optimizer hasn't been tried before for software engineering tasks.
**Methodology:** * **Training using 12 problems**: My understanding is that the SOTA accuracy number was reached after including 12 problems from SWE-Bench-Verified in the design set for BOAD. This raises two concerns: * Why 12 problems specifically? Does increasing or decreasing the set of design problems drastically effect final performance? * None of the other baselines are automatically tuning the multi-agent system. I understand that an evolutionary agent here might be less efficient than
1. Interesting formulation of the multi-agent system discovery problem as a multi-armed bandit. This creates balance between exploration and exploitation while selecting and evolving subagents. 2. Great performance on SWE-Bench Live demonstrates the effectiveness of the method. 3. Comprehensive ablation showing the effectiveness of having the subagents, customizing the orchestrator, and using hindsight helpfulness for credit assignment.
1. Lack of details about the actual optimization process. How many agents are there in the final set of subagents? How many of the top agents are from the expanded set or the initial set of subagents? What are the final top agents selected? Qualitatively why are they better than other subagents? Readers need these details to get a better idea of the final discovered system. 2. Single run evaluation is insufficient. The non-determinism of LLM agents results on lots of randomness in every agent r
The method is intuitive and interesting, although it largely relies on the abilities of LLMs for judging and proposing sub-agents. We do need various ways to balance exploration and exploitation. The searched agent demonstrates strong performance on popular benchmarks as well. The paper is generally well-written and easy to understand.
* Missing naive baseline, such as evolution search. One can treat all prompts of sub-agents and/or orchestrators as parameters and use LLMs + evolution search to optimize them. There are various prior works that balance exploration and exploitation for naive LLM tree search as well. The authors discussed this baseline in the method section, claiming it is prohibitively expensive with no experimental results. * I'm not sure if the comparison with baselines is fair, missing experimental details.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Mobile Crowdsensing and Crowdsourcing · Software Engineering Techniques and Practices
