TL;DR
Thermo-VL is a novel vision-language model that effectively integrates thermal infrared data with RGB imagery, enhancing low-light scene understanding and cross-spectrum reasoning.
Contribution
It introduces a wavelength-aware fusion module, a new RGB-thermal dataset, and a benchmark for low-light and thermal reasoning tasks.
Findings
Significant improvements on thermal-only and RGB+thermal reasoning tasks.
Effective fusion of thermal and RGB data without disrupting pretrained RGB-language models.
Availability of a new dataset and benchmark for RGB-thermal visual question answering.
Abstract
Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
