Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

TL;DR
This paper introduces SciCrafter, a Minecraft-based benchmark to evaluate AI agents' ability to navigate the entire discovery-to-application loop, revealing current models' limitations and shifting bottlenecks.
Contribution
The paper presents SciCrafter, a novel benchmark for assessing AI's discovery-to-application capabilities, and analyzes the performance and bottlenecks of state-of-the-art models.
Findings
All evaluated models plateau at about 26% success rate.
Knowledge application remains the largest gap across models.
For frontier models, identifying the right problems becomes a major challenge.
Abstract
Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
