AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Zihang Zeng; Jiaquan Zhang; Pengze Li; Yuan Qi; Xi Chen

arXiv:2603.03233·cs.AI·March 4, 2026

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a Bayesian adversarial multi-agent low-code platform that enhances scientific code generation by improving reliability, reducing error propagation, and streamlining human-AI collaboration in complex scientific tasks.

Contribution

The paper presents a novel Bayesian adversarial multi-agent framework integrated into a low-code platform for AI for Science, addressing reliability and evaluation challenges in scientific code generation.

Findings

01

Demonstrates improved code robustness and reduced error propagation.

02

Outperforms existing models on Earth Science tasks.

03

Effectively translates non-expert prompts into domain-specific requirements.

Abstract

Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics:…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Very neat idea that can be plugged into any existing agentic framework and LLMs. 2. By design, the framework allows for curriculum learning (although this is not explicitly mentioned by the authors) -- I think this is afFigure 5 strength where the difficulty level can be adjusted over iterations.

Weaknesses

1. Line 276: "TM adapts its weights for future evaluation" -- what weights are being updated here. Make it clear that the Bayesian updates are over the choice of prompts/codes, and not over tokens. Also, sub-agents are not trained, so there are no weight updates to the LLMs. 2. Are the test cases part of the plan that goes through loop 1 in Fig 2? Ideally, the domain expert should also provide some feedback on test cases during the planning phase. It is possible that the LLM generating test cas

Reviewer 02Rating 6Confidence 4

Strengths

- The underlying idea of gradually increasing the difficulty of tests that generated output must pass, where the update is done on the basis of actual external evaluation and not on the basis of potentially untrustworthy LLM assessment, is a very good one. And has been implemented well: being resource-efficient by evaluating only a subset of code generations. - Strong empirical performance of the proposed framework, across a diverse range of testbeds and use-cases.

Weaknesses

- It is very unclear *where* the “bayesian update rule” is used. As in Algorithm 1, it suggests that it is used to re-weight existing test cases to present increasingly challenging ones as iterations proceed—that is, for the final prompt curation as each iteration. As per Appendix B, the rule is used for selecting a candidate code for evaluation (hence, by the evaluator agent). - Unclear algorithmic steps: Line 12 says the $\lambda$'s are the test case weights, however they never get used in sub

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is clearly written and the ideas are easy to follow 2. The code generation framework is novel and achieves SOTA performance on both AI4S benchmarks and code generation benchmarks 3. Extensive experiments have been conducted to show more in-depth details about the framework

Weaknesses

1. For the SciCode benchmark, the baseline is somewhat unclear, and comparing it to only a single baseline seems insufficient.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Artificial Intelligence in Healthcare and Education