MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong, Man Ro

TL;DR
This paper introduces MSCoTDet, a novel framework that leverages Large Language Models and a chain-of-thought prompting strategy to mitigate modality bias and enhance multispectral pedestrian detection accuracy.
Contribution
It proposes a new language-driven multi-modal fusion method using LLMs to improve multispectral pedestrian detection by reducing modality bias.
Findings
MSCoTDet outperforms existing models in detection accuracy.
The LMF strategy effectively fuses language prompts with visual detection results.
Extensive experiments demonstrate bias mitigation and performance gains.
Abstract
Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Remote-Sensing Image Classification · Automated Road and Building Extraction
