TCBench: A Benchmark for Tropical Cyclone Track and Intensity Forecasting at the Global Scale
Milton Gomez, Marie McGraw, Saranya Ganesh S., Frederick Iat-Hin Tam, Ilia Azizi, Samuel Darmon, Monika Feldmann, Stella Bourdin, Louis Poulain--Auz\'eau, Suzana J. Camargo, Jonathan Lin, Dan Chavas, Chia-Ying Lee, Ritwik Gupta, Andrea Jenney, Tom Beucler

TL;DR
TCBench is a comprehensive benchmark for evaluating global tropical cyclone track and intensity forecasts using observational data and state-of-the-art models, facilitating fair comparisons and advancing data-driven TC prediction methods.
Contribution
It introduces a standardized, model-agnostic framework for evaluating tropical cyclone forecasts, integrating diverse models and providing accessible tools for researchers and meteorologists.
Findings
Neural weather models accurately forecast TC tracks.
Intensity forecasts need post-processing for improved skill.
Benchmark promotes reproducibility and fair comparison of models.
Abstract
TCBench is a benchmark for evaluating global, short to medium-range (1-5 days) forecasts of tropical cyclone (TC) track and intensity. To allow a fair and model-agnostic comparison, TCBench builds on the IBTrACS observational dataset and formulates TC forecasting as predicting the time evolution of an existing tropical system conditioned on its initial position and intensity. TCBench includes state-of-the-art dynamical (TIGGE) and neural weather models (AIFS, Pangu-Weather, FourCastNet v2, GenCast). If not readily available, baseline tracks are consistently derived from model outputs using the TempestExtremes library. For evaluation, TCBench provides deterministic and probabilistic storm-following metrics. On 2023 test cases, neural weather models skillfully forecast TC tracks, while skillful intensity forecasts require additional steps such as post-processing. Designed for…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is original to the best of my knowledge. - The paper is significant. It proposes a meaningful step forward for the field of data-driven weather forecasting, where most existing benchmarks evaluate the overall accuracy of methods while ignoring the equally important aspect of predicting extreme events like cyclones. - The benchmark uses a standard data format and consistent evaluation pipelines, which ensures fairness and reproducibility. - The benchmark provides post-processing steps
- One major weakness is that the benchmark only considers forecasting tracks and the intensity of an existing cyclone, not an upcoming one. However, this is still a valid setting and has practical relevance.
1.TCBench establishes a standardized and relatively fair evaluation pipeline. It uses IBTrACS as the "ground truth," converting all data into a unified format based on IBTrACS identifiers for consistency. For models that do not provide readily available tracks, it employs the unified TempestExtremes library with consistent parameters to derive tracks from raw model outputs. When a model fails to forecast a storm, TCBench does not simply ignore the sample but fills it using the persistence baseli
1.TCBench relies on the IBTrACS observational dataset as the ground truth. While IBTrACS is the most complete and authoritative global TC archive currently available, it has limitations: (1) Its quality varies by basin (e.g., lower reliability in the South Indian Ocean), inconsistencies exist in the initial track points determined by different agencies, and it lacks rigorous cross-validation against other data sources (e.g., regional satellite observations, ground radar data). Therefore, its abs
- This study presents a variety of benchmark tasks and experimental protocols for tropical cyclone prediction, encompassing data preprocessing, visualization tools, and evaluation metrics. In particular, it highlights the challenge of Rapid Intensification, pointing out the limitations of existing data-driven approaches in effectively capturing this phenomenon. - In contrast to previous data-driven methods that have mainly focused on reducing errors in track prediction, this study emphasizes th
- In Line 144, the term “real-time-available data” is used, but ERA5 is a reanalysis dataset, which means it is not available in real time. Therefore, it seems that real-time prediction would not be possible through TCBench. - This study proposes a benchmark framework that introduces various tasks and conducts experiments using baseline models. Since it aims to cover a wide range of aspects, there is still room for further experiments to demonstrate the utility of the benchmark. As shown in Fig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTropical and Extratropical Cyclones Research · Meteorological Phenomena and Simulations · Climate variability and models
