Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis

Dong Huang; Mingzhe Du; Jie M. Zhang; Zheng Lin; Meng Luo; Qianru Zhang; See-Kiong Ng

arXiv:2510.26423·cs.SE·October 31, 2025

Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis

Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng

PDF

3 Reviews

TL;DR

Nexus is a multi-agent framework that synthesizes accurate test oracles through collaborative critique, validation, and iterative refinement, significantly improving non-regression testing effectiveness across diverse benchmarks.

Contribution

This paper introduces Nexus, a novel multi-agent approach that enhances test oracle synthesis by integrating specialized agents and automated self-refinement, outperforming existing methods.

Findings

01

Nexus improves test oracle accuracy from 46.30% to 57.73%.

02

Nexus increases bug detection rate from 90.91% to 95.45%.

03

Nexus boosts automated program repair success from 35.23% to 69.32%.

Abstract

Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper is well written and easy to follow, with a good balance between qualitative examples and quantitative evidence. The problem is well motivated as well: Table 1 convincingly illustrates that while LLMs can already generate syntactically valid test inputs, they still struggle with semantic reasoning to produce correct outputs. 2. The experimental evaluation is comprehensive, involving multiple open and closed source LLMs. Results across several benchmarks are consistent and demonstrate

Weaknesses

1. Nexus is more complicated than prior frameworks. While the performance improvements are clear, the paper lacks a detailed analysis of computational cost (e.g., API budget, runtime). 2. The validation phase (lines 213–216) relies on an LLM-generated "candidate implementation" of the function under test. If this implementation is incorrect, it could lead to false positives or negatives: more analysis is needed to better understand how the oracle would behave in such a case. 3. The comparison f

Reviewer 02Rating 2Confidence 4

Strengths

I think the idea of using specialized agents to analyze method outputs is a good idea to improve upon CANDOR. The figures and tables are legible.

Weaknesses

I think this paper has several critical weaknesses, listed below. **Lack of evidence** It is unclear if the method results in true gains as claimed, and if its performance gains really stem from the proposed setup. - Table 6 only shows performance difference after a round of debugging information after direct generation/CANDOR/Nexus. Please also include the performance without debugging information as a true baseline so that it can be seen whether the oracles increase performance at all. - The

Reviewer 03Rating 2Confidence 4

Strengths

- The multi-agent deliberation and self-refinement pipeline is well designed and demonstrates good engineering effort. - Clear and consistent improvements across multiple benchmarks. - The validation loop leveraging execution feedback is a reasonable and practical idea.

Weaknesses

1. Fundamental Paradox: Solves Only the Easy Problems Nexus performs well only on simple, well-specified functions where expected behavior is trivial. Ironically, these are exactly the cases where human developers can easily write tests themselves, while the harder, more ambiguous cases remain out of reach. 2. Severely Limited Applicability (Technical Scope) The framework assumes isolated, stateless, pure functions with deterministic I/O. It cannot handle realistic challenges such as shared s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.