TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents
Yifu Cai, Xinyu Li, Mononito Goswami, Micha{\l} Wili\'nski, Gus Welter, Artur Dubrawski

TL;DR
TimeSeriesGym is a scalable benchmarking framework designed to evaluate AI agents across diverse time series machine learning engineering challenges, incorporating multiple skills, data sources, and evaluation methods to better reflect real-world ML engineering tasks.
Contribution
We introduce TimeSeriesGym, a comprehensive, scalable benchmark that evaluates AI agents on diverse skills and artifacts, addressing limitations of existing narrow and non-scalable benchmarks.
Findings
Supports evaluation of multiple research artifacts including code and models.
Balances objective and contextual evaluation methods.
Extensible to other data modalities beyond time series.
Abstract
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge…
Peer Reviews
Decision·Submitted to ICLR 2026
Identifies a Clear and Important Gap: The paper is correct that time series is an underserved domain in agentic benchmarking. Creating a dedicated benchmark for this is a valuable contribution. Focuses on "ML Engineering," Not Just Modeling: The strongest part of the paper is its inclusion of "TimeSeriesGym Originals" (Table 4). Challenges like "Convert ResNet TensorFlow implementation to PyTorch" or "Improve PTB-XL ECG Classification Code" are excellent, real-world tasks that go beyond the sta
**Critically Flawed Evaluation of Kaggle Challenges:** The paper's primary evaluation metrics, "Valid Submission (%)" and "Reasonable Submission (%)," are insufficient. - The benchmark fails to report the actual leaderboard scores or ranks for the 13 included Kaggle competitions. This is a significant omission, as these are the most standardized, competitive tasks in the dataset. - The bar for a "Reasonable Submission" is set at scoring "above median on the competition's public leaderboard" (
1、 The benchmark collects and designs tasks based on real-world data science scenarios, including Kaggle competition problems and practical research tasks such as code migration and model evaluation. These challenges span a wide range of skills—including data processing, model construction, and code understanding and adaptation—reflecting the multifaceted challenges faced by real-world machine learning engineers. 2、 TimeSeriesGym evaluates multiple forms of agent outputs, not only focusing on p
1、 The primary contribution of the paper lies in the construction of the benchmark. Many of its ideas—such as using LLMs for code review, incorporating multi-source tasks, and adopting multi-metric evaluations—are extensions and integrations of existing work rather than entirely novel innovations. 2、 Although the paper lists several existing benchmarks, the distinctions and connections between TimeSeriesGym and those benchmarks are not sufficiently elaborated. For example, beyond the domain dif
- Clear motivation that time series engineering requires more than single step prediction and that agents should be evaluated on multi stage workflows. - The authors collected a large number of datasets, with broad task coverage and inclusion of multiple domains which improves the ecological validity of the benchmark. - In this paper, multi artifact evaluation that considers predictions, code quality checks, and trained models which aligns the benchmark with real practice. - Agent agnostic de
1. The paper mixes the benchmark, the task generation mechanism, the multi artifact scoring, the trajectory data loop, and the time series focus without a clear primary to secondary hierarchy and without a concise contribution figure. It does not decompose difficulty across different time series task families and it does not design difficulty along agent reasoning dimensions. 2. Due to compute constraints most experiments are conducted on the Lite set of six tasks. Although the authors state th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Time Series Analysis and Forecasting
MethodsFocus · Sparse Evolutionary Training
