DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Zhitong Chen; Kai Yin; Xiangjue Dong; Chengkai Liu; Xiangpeng Li; Yiming Xiao; Bo Li; Junwei Ma; Ali Mostafavi; James Caverlee

arXiv:2601.03670·cs.CL·January 8, 2026

DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Yiming Xiao, Bo Li, Junwei Ma, Ali Mostafavi, James Caverlee

PDF

Open Access

TL;DR

DisastQA is a large-scale benchmark designed to evaluate question answering models' ability to reason over uncertain, conflicting, and noisy information in disaster management scenarios, highlighting current models' reliability gaps.

Contribution

The paper introduces DisastQA, a comprehensive, human-LLM curated benchmark with diverse questions and evaluation protocols tailored for disaster-related QA under realistic conditions.

Findings

01

Models perform well in clean settings but poorly with noisy evidence.

02

Recent models approach proprietary systems in ideal conditions.

03

Performance drops significantly under realistic noisy evidence scenarios.

Abstract

Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling · Multimodal Machine Learning Applications