SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen; Yajie Yang; Zhiheng Xi; Binze Hu; Huayu Sha; Jiazheng Zhang; Qiyuan Peng; Junlin Shang; Jixuan Huang; Yutao Fan; Jingqi Tong; Shihan Dou; Ming Zhang; Lei Bai; Zhenfei Yin; Tao Gui; Xingjun Ma; Qi Zhang; Xuanjing Huang; Yu-Gang Jiang

arXiv:2602.12984·cs.CL·February 16, 2026

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

PDF

Open Access 1 Datasets

TL;DR

SciAgentGym introduces a comprehensive benchmark and environment for evaluating and improving large language models' ability to perform complex, multi-step scientific tool-use across multiple disciplines, highlighting current limitations and proposing a new training method.

Contribution

The paper presents SciAgentGym and SciAgentBench for benchmarking scientific tool-use, and proposes SciForge, a data synthesis method, to enhance models' multi-step scientific reasoning capabilities.

Findings

01

State-of-the-art models struggle with complex scientific workflows.

02

Fine-tuning with logic-aware trajectories improves performance.

03

Success rates drop significantly with longer interaction horizons.

Abstract

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Maxue627/SCILLM-benchmarks
dataset· 154 dl
154 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Topic Modeling