Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel; Mohamed Elmahallawy; Sanjay Madria; Samuel Frimpong

arXiv:2512.09092·cs.CV·December 11, 2025

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel, Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong

PDF

Open Access

TL;DR

This paper introduces MDSE, a multimodal vision-language framework designed to generate detailed textual explanations of underground disaster scenes, enhancing situational awareness in obscured and hazardous environments.

Contribution

The paper presents a novel multimodal framework with innovative attention and encoding mechanisms, along with a new dataset for underground disaster scene captioning.

Findings

01

MDSE outperforms existing captioning models on UMD dataset

02

The framework produces more accurate, contextually relevant descriptions

03

Experimental results demonstrate improved situational awareness capabilities

Abstract

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis