MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral   Pedestrian Detection

Taeheon Kim; Sangyun Chung; Damin Yeom; Youngjoon Yu; Hak Gu Kim; Yong; Man Ro

arXiv:2403.15209·cs.CV·January 9, 2025·2 cites

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong, Man Ro

PDF

Open Access

TL;DR

This paper introduces MSCoTDet, a novel framework that leverages Large Language Models and a chain-of-thought prompting strategy to mitigate modality bias and enhance multispectral pedestrian detection accuracy.

Contribution

It proposes a new language-driven multi-modal fusion method using LLMs to improve multispectral pedestrian detection by reducing modality bias.

Findings

01

MSCoTDet outperforms existing models in detection accuracy.

02

The LMF strategy effectively fuses language prompts with visual detection results.

03

Extensive experiments demonstrate bias mitigation and performance gains.

Abstract

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Remote-Sensing Image Classification · Automated Road and Building Extraction