OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan; Thejas Venkatesh; Xiang Ren; Sai Praneeth Karimireddy; Ashwin Paranjape; Yuhao Zhang; Jack Hessel

arXiv:2602.15197·cs.CL·February 18, 2026

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

PDF

Open Access 3 Reviews

TL;DR

This paper introduces OpaqueToolsBench, a benchmark for evaluating LLMs in environments with opaque tools, and proposes ToolObserver, a method that improves tool documentation through interaction feedback, enhancing performance and efficiency.

Contribution

The paper presents OpaqueToolsBench for testing LLMs with opaque tools and introduces ToolObserver, a novel framework for refining tool documentation via interaction feedback.

Findings

01

ToolObserver outperforms existing documentation methods on OpaqueToolsBench.

02

It achieves 3.5-7.5x token savings during test-time exploration.

03

Existing automatic documentation methods are costly and unreliable for opaque tools.

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The concept of opaque tool invocation is novel and opens an interesting direction for further research. 2. The experiments are extensive and well-aligned with the proposed idea. 3. The paper presents a clear analysis, demonstrating how performance evolves across iterations and documentation levels.

Weaknesses

1. The exploration and reflection phases in offline mode lack sufficient detail. The paper provides only a high-level description of these phases without sufficient algorithmic or implementation details. 2. ToolObserver offers limited novelty and resembles prior reflection-based methods. 3. The three benchmark scenarios may not fully represent real-world tool use. 4. Benchmark performance remains low. Even with optimization, the best reported results are still modest, which raises concerns about

Reviewer 02Rating 4Confidence 3

Strengths

- The experiment results are encouraging, proving the the proposed method can generate good tool documentations that help improve the performance of models. - Compared to the baseline method, the proposed strategy does not have to process all tools. as a result, they end up significantly save generation tokens for tool documents.

Weaknesses

I feel the given situation is over-complicated / not well motivated. There are several reasons that tools won't be opaque in most applications: 1. agent developers are trying their best to improve the performance. giving the agent good tool documentation is among the easiest improvement they can do. 2. tool developers would maximize the chance that their tool gets called. as a result, they will work on improving the tool documentation so they are easy for the models to understand. 3. the best us

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is well-organized and easy to follow, with clear presentation of benchmark statistics and evaluation metrics. - The data generation pipeline is simple, scalable, and comprehensively described. - The simplicity and token efficiency of TOOLOBSERVER make it directly applicable to real-world tool-use scenarios.

Weaknesses

While three domains are tested, the evaluation metrics are somewhat limited. For example, Tables 2 and 3 only report overall accuracy. It would be informative to include additional metrics such as parameter accuracy or Abstract Syntax Tree (measures the generated function call format) etc.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education