TL;DR
This paper introduces a new evaluation paradigm called black-box environment interaction to assess the integrated reasoning abilities of large language models (LLMs) in unknown environments, highlighting their strengths and limitations.
Contribution
It proposes the Oracle benchmark with diverse black-box tasks to evaluate LLM reasoning in interactive settings and analyzes their performance and challenges.
Findings
O3, a leading LLM, achieves over 70% accuracy on easy black-box tasks.
LLMs struggle with complex black-box tasks, with performance below 40%.
High-level planning is a universal challenge for LLMs in these tasks.
Abstract
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for human-like discovery learning. We introduce a novel evaluation paradigm, \textit{black-box environment interaction}, to tackle this challenge. A black-box environment is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box environment by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task with 96 black-box environments. 19 modern LLMs are benchmarked. o3, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
