SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection
Tianye Qi, Weihao Li, Nick Barnes

TL;DR
This paper introduces SmokeBench, a benchmark to evaluate multimodal large language models' ability to detect and localize wildfire smoke, revealing current models' limitations in early-stage smoke detection.
Contribution
The paper presents SmokeBench, a new benchmark with four tasks for assessing MLLMs' wildfire smoke detection and localization capabilities, highlighting their current shortcomings.
Findings
Models can detect large-area smoke presence
Localization accuracy is poor, especially in early stages
Smoke volume correlates with detection performance
Abstract
Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFire Detection and Safety Systems · Image Enhancement Techniques · Fire effects on ecosystems
