TL;DR
This paper introduces a novel semantic bridge fusion framework for multispectral object detection that leverages text semantics to better align RGB and IR modalities, addressing granularity and discrepancy issues.
Contribution
It proposes a bi-support modeling approach using text as a shared semantic bridge and introduces a structured fusion method that incorporates consensus and discrepancy supports.
Findings
Achieves superior detection performance on multispectral benchmarks.
Effectively aligns RGB and IR responses using text-guided semantic mapping.
Demonstrates the benefits of modeling cross-modal discrepancies in fusion.
Abstract
Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
