ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx

TL;DR
ToolComp is a new benchmark for evaluating multi-step reasoning with multiple tools, emphasizing correctness of intermediate steps and final answers, and demonstrating the importance of process supervision in improving AI performance.
Contribution
The paper introduces ToolComp, a comprehensive benchmark with human-verified prompts and process labels, and shows that process-supervised reward models outperform outcome-supervised ones in complex reasoning tasks.
Findings
Most models achieve less than 50% accuracy on ToolComp.
Process-supervised reward models outperform outcome-supervised models.
PRMs improve ranking accuracy by 19% and 11% over ORMs.
Abstract
Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging…
Peer Reviews
Decision·Submitted to ICLR 2025
> Tool use is an important area, and having more dataset/benchmark contributions in this area is useful. The use of step level supervision/verification also is proving to be a promising area making this contribution pretty timely. > The empirical comparisons in this paper with existing IT checkpoints in zero-shot evals are pretty extensive and thorough
> The contributed dataset is relatively small. With just 485 prompts, and 1731 per step level annotations. It makes it hard to justify using this dataset for large scale model training and pragmatically can only be used as a benchmark. > The experiments in section 5 lack sufficient details. I had to go into the details of the experiments in the Appendix to understand Section 5 better.
1. The motivation for this paper is strong, addressing a gap in the field by providing a dataset that supervises the process of tool usage rather than just the final outcome, allowing for accurate evaluation of multi-step tool-use reasoning. 2. The method of generating queries and answers combines LLM outputs with human validation and careful inspection of each instance in the dataset, ensuring high efficiency. 3. For dataset evaluation, the authors implement LLM grading to avoid the limitations
1. The evaluation is limited to general LLMs, without testing LLMs specifically fine-tuned for tool use. Including specialized tool-use LLMs could yield deeper insights into model performance in this particular task domain and highlight differences in tool-specific reasoning abilities. 2. Figure 8 (line 1206) reveals that among the 11 tools included, only four are frequently used, while the rest appear in very few instances. This usage pattern, particularly in the ToolComp-Chat subset, may intro
1. The benchmark introduced in this paper, i.e. ToolComp, is novel, sound and solid. - Novelty: They novelly benchmark LLM-as-a-judge on intermediate steps in multi-step tool use. (Cf. Table 1) - Soundness: 1) The way they evaluate LLMs on multi-step tool use is sound and they also incorporate 95% CIs which makes the results more convincing. 2) The setup of Chat and Enterprise is reasonable. 3) The LLM-as-a-judge evaluation is sound and serves as an important benchmark for both LLM evaluation an
1. The author mentioned that ToolComp is complex. However, from Table 1 we can only know that it has fewer number of tools. It'll be more convincing if the authors can report quantitative comparisons, e.g., accuracy of LLMs, to other benchmarks. Or, it will also be more convincing if the authors can elaborate on why the other tools are not included. 2. As the authors mentioned in Table 2, the Llamas should be used with constrained decoding to guarantee valid outputs. It is unclear whether constr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Service-Oriented Architecture and Web Services
MethodsBalanced Selection
