TCBench: A Benchmark for Tropical Cyclone Track and Intensity Forecasting at the Global Scale

Milton Gomez; Marie McGraw; Saranya Ganesh S.; Frederick Iat-Hin Tam; Ilia Azizi; Samuel Darmon; Monika Feldmann; Stella Bourdin; Louis Poulain--Auz\'eau; Suzana J. Camargo; Jonathan Lin; Dan Chavas; Chia-Ying Lee; Ritwik Gupta; Andrea Jenney; Tom Beucler

arXiv:2601.23268·cs.CE·February 2, 2026

TCBench: A Benchmark for Tropical Cyclone Track and Intensity Forecasting at the Global Scale

Milton Gomez, Marie McGraw, Saranya Ganesh S., Frederick Iat-Hin Tam, Ilia Azizi, Samuel Darmon, Monika Feldmann, Stella Bourdin, Louis Poulain--Auz\'eau, Suzana J. Camargo, Jonathan Lin, Dan Chavas, Chia-Ying Lee, Ritwik Gupta, Andrea Jenney, Tom Beucler

PDF

Open Access 3 Reviews

TL;DR

TCBench is a comprehensive benchmark for evaluating global tropical cyclone track and intensity forecasts using observational data and state-of-the-art models, facilitating fair comparisons and advancing data-driven TC prediction methods.

Contribution

It introduces a standardized, model-agnostic framework for evaluating tropical cyclone forecasts, integrating diverse models and providing accessible tools for researchers and meteorologists.

Findings

01

Neural weather models accurately forecast TC tracks.

02

Intensity forecasts need post-processing for improved skill.

03

Benchmark promotes reproducibility and fair comparison of models.

Abstract

TCBench is a benchmark for evaluating global, short to medium-range (1-5 days) forecasts of tropical cyclone (TC) track and intensity. To allow a fair and model-agnostic comparison, TCBench builds on the IBTrACS observational dataset and formulates TC forecasting as predicting the time evolution of an existing tropical system conditioned on its initial position and intensity. TCBench includes state-of-the-art dynamical (TIGGE) and neural weather models (AIFS, Pangu-Weather, FourCastNet v2, GenCast). If not readily available, baseline tracks are consistently derived from model outputs using the TempestExtremes library. For evaluation, TCBench provides deterministic and probabilistic storm-following metrics. On 2023 test cases, neural weather models skillfully forecast TC tracks, while skillful intensity forecasts require additional steps such as post-processing. Designed for…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The paper is original to the best of my knowledge. - The paper is significant. It proposes a meaningful step forward for the field of data-driven weather forecasting, where most existing benchmarks evaluate the overall accuracy of methods while ignoring the equally important aspect of predicting extreme events like cyclones. - The benchmark uses a standard data format and consistent evaluation pipelines, which ensures fairness and reproducibility. - The benchmark provides post-processing steps

Weaknesses

- One major weakness is that the benchmark only considers forecasting tracks and the intensity of an existing cyclone, not an upcoming one. However, this is still a valid setting and has practical relevance.

Reviewer 02Rating 4Confidence 4

Strengths

1.TCBench establishes a standardized and relatively fair evaluation pipeline. It uses IBTrACS as the "ground truth," converting all data into a unified format based on IBTrACS identifiers for consistency. For models that do not provide readily available tracks, it employs the unified TempestExtremes library with consistent parameters to derive tracks from raw model outputs. When a model fails to forecast a storm, TCBench does not simply ignore the sample but fills it using the persistence baseli

Weaknesses

1.TCBench relies on the IBTrACS observational dataset as the ground truth. While IBTrACS is the most complete and authoritative global TC archive currently available, it has limitations: (1) Its quality varies by basin (e.g., lower reliability in the South Indian Ocean), inconsistencies exist in the initial track points determined by different agencies, and it lacks rigorous cross-validation against other data sources (e.g., regional satellite observations, ground radar data). Therefore, its abs

Reviewer 03Rating 4Confidence 4

Strengths

- This study presents a variety of benchmark tasks and experimental protocols for tropical cyclone prediction, encompassing data preprocessing, visualization tools, and evaluation metrics. In particular, it highlights the challenge of Rapid Intensification, pointing out the limitations of existing data-driven approaches in effectively capturing this phenomenon. - In contrast to previous data-driven methods that have mainly focused on reducing errors in track prediction, this study emphasizes th

Weaknesses

- In Line 144, the term “real-time-available data” is used, but ERA5 is a reanalysis dataset, which means it is not available in real time. Therefore, it seems that real-time prediction would not be possible through TCBench. - This study proposes a benchmark framework that introduces various tasks and conducts experiments using baseline models. Since it aims to cover a wide range of aspects, there is still room for further experiments to demonstrate the utility of the benchmark. As shown in Fig

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTropical and Extratropical Cyclones Research · Meteorological Phenomena and Simulations · Climate variability and models