Mixture of Scale Experts for Alignment-free RGBT Video Object Detection and A Unified Benchmark
Qishun Wang, Zhengzheng Tu, Kunpeng Wang, Le Gu, Chuanwang Guo

TL;DR
This paper introduces MSENet, a novel alignment-free RGBT video object detection framework that leverages multi-scale experts, dynamic routing, and deformable convolution to handle scale and spatial discrepancies without manual alignment, supported by a new diverse benchmark dataset.
Contribution
The paper proposes MSENet, a scale-aware, alignment-free detection network with a new benchmark dataset for RGBT video object detection.
Findings
MSENet effectively captures scale discrepancies without explicit alignment.
Deformable convolution mitigates spatial misalignment issues.
The new dataset provides a comprehensive platform for evaluation.
Abstract
Existing RGB-Thermal Video Object Detection (RGBT VOD) methods predominantly rely on the manual alignment of image pairs, that is both labor-intensive and time-consuming. This dependency significantly restricts the scalability and practical applicability of these methods in real-world scenarios. To address this critical limitation, we propose a novel framework termed the Mixture of Scale Experts Network (MSENet). MSENet integrates multiple experts trained at different perceptual scales, enabling the capture of scale discrepancies between RGB and thermal image pairs without the need for explicit alignment. Specifically, to address the issue of unaligned scales, MSENet introduces a set of experts designed to perceive the correlation between RGBT image pairs across various scales. These experts are capable of identifying and quantifying the scale differences inherent in the image pairs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face and Expression Recognition · Advanced Neural Network Applications
MethodsDeformable Convolution · Convolution · Sparse Evolutionary Training
