SciNav: A General Agent Framework for Scientific Coding Tasks

Tianshu Zhang; Huan Sun

arXiv:2603.20256·cs.CL·March 24, 2026

SciNav: A General Agent Framework for Scientific Coding Tasks

Tianshu Zhang, Huan Sun

PDF

Open Access 3 Reviews

TL;DR

SciNav is a novel framework for scientific coding tasks that uses pairwise relative judgments within a tree search to efficiently explore solutions, outperforming previous methods across benchmarks.

Contribution

Introduces SciNav, an end-to-end science agent framework leveraging pairwise relative judgments and tree search for improved scientific coding performance.

Findings

01

SciNav outperforms direct prompting and prior agents.

02

Effective across different task types and difficulty levels.

03

Relative judgment-guided search enhances solution quality.

Abstract

Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

S1. The relative judgment–guided Top-K search is a well-motivated methodological idea that builds on prior insights about the reliability of pairwise evaluation and applied in an agentic setting. S2. The experiments are reasonable, covering two benchmarks, several LLM backbones, and detailed component ablations. The experiments for the contributions of each component, including initial plan diversity, self-improvement, and the comparator strategy are appreciated.

Weaknesses

While the results and experiments are good, my main concerns center around how much we can interpret from them which I'm happy to change with some clarification. First, the paper does not report error bars or statistical significance. This makes it hard to assess whether observed performance differences are meaningful or consistent across runs. Second, it is important, especially when we consider deployment to also compare the cost of each agent/ablation involved. How many extra LLM calls/toke

Reviewer 02Rating 6Confidence 3

Strengths

(1) The paper is well-motivated and clearly defines the need for principled frameworks for scientific coding tasks with verifiable outputs. (2) It presents a structured search method combining relative judgments and iterative refinement, supported by consistent quantitative improvements over existing agent baselines.

Weaknesses

* Evaluation is limited to two controlled benchmarks, leaving uncertainty about generalization to real-world or open-ended scientific tasks. * Reliance on LLM-as-judge comparisons may introduce bias, as the same models both generate and evaluate solutions. * The fixed and narrow search budget restricts exploration, and scalability to more complex tasks remains unclear.

Reviewer 03Rating 6Confidence 3

Strengths

- The empirical result of outperforming OpenHands and Self-Debug is quite compelling - nice to see search budgets taken into account in the framework - nice to see you leveraging existing benchmarks rather than creating a new one - ablation that suggests relative judgements are helping (a little bit)

Weaknesses

- Gains are somewhat modest (~2%-3%), so the impact of the work seems a little limited - Comparison with genetic algorithm approaches to coding (e.g., in AI Scientist) would be useful Minor: - Abstract takes way to long too get to the goal and contribution - should be stated in first or second sentence. (The abstract gives the impression at first you're going to propose a benchmark) - Would be worth expanding on use of relative judgements in AI, e.g., it's the basis of A/B testing, preferen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Multimodal Machine Learning Applications