ML-Tool-Bench: Tool-Augmented Planning for ML Tasks
Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup Rao, Branislav Kveton

TL;DR
This paper introduces a comprehensive benchmark for evaluating tool-augmented ML agents capable of complex end-to-end data science workflows, highlighting current limitations and proposing effective solutions to improve planning and execution.
Contribution
The work presents a new benchmark with 61 tools and 15 Kaggle challenges, and proposes two methods that significantly enhance the planning capabilities of ML agents.
Findings
Standard approaches struggle with complex ML pipelines.
Tree search with LLM evaluation underperforms due to inconsistent scoring.
Proposed methods improve performance by over 16 percentile positions.
Abstract
The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The task is very relevant. Real ML engineering requires multi-step planning, reasoning, and execution, and the benchmark targets this long-horizon setting with artifact reuse. The scratchpad design directly addresses state corruption issues in multi-step pipelines. 2. The setup of representing ML steps as tools reduces emphasis on raw coding and focuses evaluation on the structure of the ML pipeline. The toolset is clearly scoped across the main stages of tabular ML. 3. The experiments
1. Benchmark size - With only 15 tasks the variance is high across problems, which makes aggregate comparisons unstable. The tables show very large swings in percentile across tasks, including cases with near 0 and near 100 for the same methods on different tasks. I would suggest expanding to atleast 30–50 tasks per task family to produce a more reliable signal, and claims should be scoped to tabular ML for now. 2. Data interaction clarity - Many tools target code-level operations like loading
1. Extending tool-use evaluation to end-to-end ML workflows with long-horizon planning and artifact management. 2. The proposed approach includes multiple complementary features. 3. Results show consistent improvement across different models and metrics.
1. Manual subtask or tool assignment could bias results. Hierarchical MCTS depends on hand-assigning tools to subtasks, which may imply prior knowledge and limit the generality and automation of the method. 2. The scope of evaluation limited to 15 tabular data challenges. This limits the benchmark's breadth in evaluating a wider range of ML tasks. Furthermore, the benchmark’s 61 tools do not adequately demonstrate scalability to large action spaces. 3. Novelty is insufficient. The method relie
- The paper addresses an underexplored aspect of LLM-for-ML research—tool reasoning—bridging the gap between abstract workflow generation and real tool invocation. - The dataset is carefully constructed, featuring verified tool APIs and multi-step plans that make evaluation more interpretable and diagnostic than previous black-box setups.
- While the benchmark is well-motivated, its methodological novelty remains limited—it mainly reorganizes existing task formulations and evaluation paradigms into a cleaner schema, without introducing new modeling or learning components. - As a benchmark paper, the overall scale feels somewhat limited: both the dataset size and the number of baselines are modest, and the work would be more convincing if it compared tool-based reasoning with direct code-generation approaches (e.g., AutoML or ML-a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Machine Learning and Data Classification
