EWE: An Agentic Framework for Extreme Weather Analysis

Zhe Jiang; Jiong Wang; Xiaoyu Yue; Zijie Guo; Wenlong Zhang; Fenghua Ling; Wanli Ouyang; Lei Bai

arXiv:2511.21444·cs.AI·November 27, 2025

EWE: An Agentic Framework for Extreme Weather Analysis

Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai

PDF

Open Access 3 Reviews

TL;DR

EWE is an innovative AI framework that automates the diagnostic analysis of extreme weather events, combining expert-like reasoning with multimodal data interpretation to advance scientific understanding and democratize access to meteorological expertise.

Contribution

The paper introduces EWE, the first intelligent agent for automated extreme weather diagnostics, and provides a new benchmark dataset and evaluation metric for this emerging research area.

Findings

01

EWE successfully produces interpretable visualizations from raw data.

02

EWE demonstrates effective diagnostic reasoning on high-impact weather events.

03

The benchmark dataset facilitates standardized evaluation in this field.

Abstract

Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

Automated diagnostic reasoning of extreme weather events could be an important research topic.

Weaknesses

1. The details of the method are missing. For example, the authors mentioned embedded meteorological knowledge, but they never explained what kind of knowledge or rules are embedded, how they annotate for the CoT guidelines, 2. The dataset they created is one of the main contributions as the authors wrote. However, they did not provide the details of the dataset, such as, how they picked up 103 extreme events, how many variables did they select, etc. 3. The evaluation in this paper is not reliab

Reviewer 02Rating 4Confidence 3

Strengths

**Problem Significance:** Automating the diagnostic analysis process, as proposed, could dramatically accelerate research, improve forecasting models, and inform climate adaptation strategies, particularly in developing nations that often lack dedicated meteorological expertise. The motivation for this work is compelling and well-articulated. **Novel Benchmark Dataset:** The curation of a benchmark dataset of 103 high-impact extreme weather events is a significant and tangible contribution. Thi

Weaknesses

**Relation to the ReAct Framework:** The core reasoning loop of "Thought, Action, Observation, and Interpretation" is functionally very similar to the established ReAct paradigm. Explicitly framing the work as an application and domain-specific adaptation of ReAct would help clarify the paper's contribution, shifting the focus from the framework's structure to its successful implementation in a complex scientific domain. **Contextualizing Tool Use:** The "Meteorological Toolkit" is a necessa

Reviewer 03Rating 4Confidence 4

Strengths

S1: The problem addressed by this paper is of significant scientific and societal importance. The shift in focus within AI for Earth Science from pure prediction (the domain of models like Pangu-Weather and GraphCast) to automated scientific understanding and diagnosis is a crucial and welcome research direction. This work pioneers a new task definition: automated post-hoc diagnostic reasoning. This is a valuable contribution in itself, as it frames a complex scientific workflow as a tractable p

Weaknesses

W1: The paper's entire quantitative evaluation (Table 1) relies on a single multimodal large model, GPT-4.1, as the judge. This "LLM-as-a-Judge" approach is widely documented to suffer from multiple potential biases, such as verbosity bias (favoring longer answers), position bias, and overlooking fallacies in reasoning. The paper's core claims rest on an evaluation protocol that has not been sufficiently validated, constituting a fatal flaw in its scientific rigor. Specifically, the paper suffer

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Meteorological Phenomena and Simulations · Geographic Information Systems Studies