TL;DR
DUALVISION introduces a fusion module for multimodal large language models that integrates infrared and RGB images, enhancing robustness in degraded visual conditions, supported by new datasets and benchmarking tools.
Contribution
It proposes a lightweight IR-RGB fusion module for MLLMs and provides new datasets and benchmarks for evaluating multimodal reasoning under challenging conditions.
Findings
DUALVISION improves robustness of MLLMs under fog, blur, and low-light conditions.
The datasets enable comprehensive evaluation of IR-RGB multimodal reasoning.
Benchmark results show DUALVISION's superior performance across various visual degradations.
Abstract
Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
