Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

Congchi Yin; Tianyi Wu; Yankai Shu; Alex Gu; Yunhan Wang; Jun Shao; Xun Jiang; Piji Li

arXiv:2508.19035·cs.AI·May 7, 2026

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li

PDF

1 Repo

TL;DR

This paper introduces a new evaluation paradigm called black-box environment interaction to assess the integrated reasoning abilities of large language models (LLMs) in unknown environments, highlighting their strengths and limitations.

Contribution

It proposes the Oracle benchmark with diverse black-box tasks to evaluate LLM reasoning in interactive settings and analyzes their performance and challenges.

Findings

01

O3, a leading LLM, achieves over 70% accuracy on easy black-box tasks.

02

LLMs struggle with complex black-box tasks, with performance below 40%.

03

High-level planning is a universal challenge for LLMs in these tasks.

Abstract

Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for human-like discovery learning. We introduce a novel evaluation paradigm, \textit{black-box environment interaction}, to tackle this challenge. A black-box environment is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box environment by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task with 96 black-box environments. 19 modern LLMs are benchmarked. o3, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lemonsis/Oracle_Benchmark
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.