K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Soyeon Kim; Cheongwoong Kang; Myeongjin Lee; Eun-Chul Chang; Jaedeok Lee; Jaesik Choi

arXiv:2604.24645·cs.CL·April 28, 2026

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Soyeon Kim, Cheongwoong Kang, Myeongjin Lee, Eun-Chul Chang, Jaedeok Lee, Jaesik Choi

PDF

1 Repo 1 Datasets

TL;DR

K-MetBench is a comprehensive benchmark designed to evaluate Korean weather AI models across visual reasoning, logical validity, cultural understanding, and domain-specific analysis, revealing significant modality and reasoning gaps.

Contribution

This paper introduces K-MetBench, a multidimensional, expert-level evaluation framework for Korean weather models, highlighting cultural and modality challenges in AI reasoning.

Findings

01

Models show a significant modality gap in interpreting diagrams.

02

Models hallucinate logic despite correct predictions.

03

Korean models outperform larger global models in local contexts.

Abstract

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/soyeonbot/K-MetBench
github

Datasets

soyeonbot/K-MetBench
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.