IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

Zhe Cao; Jin Zhang; Ruiheng Zhang

arXiv:2507.14449·cs.CV·July 22, 2025

IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

Zhe Cao, Jin Zhang, Ruiheng Zhang

PDF

TL;DR

IRGPT is a pioneering multi-modal large language model designed for real-world infrared images, utilizing a large-scale authentic dataset and a curriculum transfer learning strategy to outperform existing methods across multiple tasks.

Contribution

The paper introduces IRGPT, the first multi-modal large language model for infrared images, built on a large-scale authentic dataset and a novel curriculum transfer learning approach.

Findings

01

Achieves state-of-the-art results on 9 infrared vision tasks.

02

Utilizes a large-scale dataset with over 260K image-text pairs.

03

Demonstrates effective knowledge transfer from visible to infrared domains.

Abstract

Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.