DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

Abrar Majeedi; Zhiyuan Ruan; Ziyi Zhao; Hongcheng Wang; Jianglin Lu; Yin Li

arXiv:2604.18829·cs.CV·April 22, 2026

DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

Abrar Majeedi, Zhiyuan Ruan, Ziyi Zhao, Hongcheng Wang, Jianglin Lu, Yin Li

PDF

1 Repo

TL;DR

DUALVISION introduces a fusion module for multimodal large language models that integrates infrared and RGB images, enhancing robustness in degraded visual conditions, supported by new datasets and benchmarking tools.

Contribution

It proposes a lightweight IR-RGB fusion module for MLLMs and provides new datasets and benchmarks for evaluating multimodal reasoning under challenging conditions.

Findings

01

DUALVISION improves robustness of MLLMs under fog, blur, and low-light conditions.

02

The datasets enable comprehensive evaluation of IR-RGB multimodal reasoning.

03

Benchmark results show DUALVISION's superior performance across various visual degradations.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://abrarmajeedi.github.io/dualvision
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.