TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Mykola Pinchuk

arXiv:2603.05764·cs.LG·March 9, 2026

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

Mykola Pinchuk

PDF

Open Access

TL;DR

TML-Bench introduces a comprehensive benchmark to evaluate data science agents' performance on Kaggle-style tabular tasks, emphasizing correctness, reliability, and scalability across different time budgets.

Contribution

This work presents TML-Bench, a novel benchmark for assessing the effectiveness of autonomous data science agents on tabular machine learning tasks.

Findings

01

MiniMax-M2.1 outperforms other models overall.

02

Performance improves with increased time budgets.

03

Scaling results are noisy for some models at current run counts.

Abstract

Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and reliability under time limits. This paper introduces TML-Bench, a tabular benchmark for data science agents on Kaggle-style tasks. This paper evaluates 10 OSS LLMs on four Kaggle competitions and three time budgets (240s, 600s, and 1200s). Each model is run five times per task and budget. A run is successful if it produces a valid submission and a private-holdout score on hidden labels that are not accessible to the agent. This paper reports median performance, success rates, and run-to-run variability. MiniMax-M2.1 model achieves the best aggregate performance score on all four competitions under the paper's primary aggregation. Average performance improves with larger time budgets. Scaling is noisy for some individual models at the current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Algorithms