ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Vaskar Nath; Pranav Raja; Claire Yoon; Sean Hendryx

arXiv:2501.01290·cs.CL·January 3, 2025

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

ToolComp is a new benchmark for evaluating multi-step reasoning with multiple tools, emphasizing correctness of intermediate steps and final answers, and demonstrating the importance of process supervision in improving AI performance.

Contribution

The paper introduces ToolComp, a comprehensive benchmark with human-verified prompts and process labels, and shows that process-supervised reward models outperform outcome-supervised ones in complex reasoning tasks.

Findings

01

Most models achieve less than 50% accuracy on ToolComp.

02

Process-supervised reward models outperform outcome-supervised models.

03

PRMs improve ranking accuracy by 19% and 11% over ORMs.

Abstract

Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

> Tool use is an important area, and having more dataset/benchmark contributions in this area is useful. The use of step level supervision/verification also is proving to be a promising area making this contribution pretty timely. > The empirical comparisons in this paper with existing IT checkpoints in zero-shot evals are pretty extensive and thorough

Weaknesses

> The contributed dataset is relatively small. With just 485 prompts, and 1731 per step level annotations. It makes it hard to justify using this dataset for large scale model training and pragmatically can only be used as a benchmark. > The experiments in section 5 lack sufficient details. I had to go into the details of the experiments in the Appendix to understand Section 5 better.

Reviewer 02Rating 6Confidence 3

Strengths

1. The motivation for this paper is strong, addressing a gap in the field by providing a dataset that supervises the process of tool usage rather than just the final outcome, allowing for accurate evaluation of multi-step tool-use reasoning. 2. The method of generating queries and answers combines LLM outputs with human validation and careful inspection of each instance in the dataset, ensuring high efficiency. 3. For dataset evaluation, the authors implement LLM grading to avoid the limitations

Weaknesses

1. The evaluation is limited to general LLMs, without testing LLMs specifically fine-tuned for tool use. Including specialized tool-use LLMs could yield deeper insights into model performance in this particular task domain and highlight differences in tool-specific reasoning abilities. 2. Figure 8 (line 1206) reveals that among the 11 tools included, only four are frequently used, while the rest appear in very few instances. This usage pattern, particularly in the ToolComp-Chat subset, may intro

Reviewer 03Rating 8Confidence 3

Strengths

1. The benchmark introduced in this paper, i.e. ToolComp, is novel, sound and solid. - Novelty: They novelly benchmark LLM-as-a-judge on intermediate steps in multi-step tool use. (Cf. Table 1) - Soundness: 1) The way they evaluate LLMs on multi-step tool use is sound and they also incorporate 95% CIs which makes the results more convincing. 2) The setup of Chat and Enterprise is reasonable. 3) The LLM-as-a-judge evaluation is sound and serves as an important benchmark for both LLM evaluation an

Weaknesses

1. The author mentioned that ToolComp is complex. However, from Table 1 we can only know that it has fewer number of tools. It'll be more convincing if the authors can report quantitative comparisons, e.g., accuracy of LLMs, to other benchmarks. Or, it will also be more convincing if the authors can elaborate on why the other tools are not included. 2. As the authors mentioned in Table 2, the Llamas should be used with constrained decoding to guarantee valid outputs. It is unclear whether constr

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Service-Oriented Architecture and Web Services

MethodsBalanced Selection