Mixture of Scale Experts for Alignment-free RGBT Video Object Detection   and A Unified Benchmark

Qishun Wang; Zhengzheng Tu; Kunpeng Wang; Le Gu; Chuanwang Guo

arXiv:2410.12143·cs.CV·April 21, 2025

Mixture of Scale Experts for Alignment-free RGBT Video Object Detection and A Unified Benchmark

Qishun Wang, Zhengzheng Tu, Kunpeng Wang, Le Gu, Chuanwang Guo

PDF

Open Access

TL;DR

This paper introduces MSENet, a novel alignment-free RGBT video object detection framework that leverages multi-scale experts, dynamic routing, and deformable convolution to handle scale and spatial discrepancies without manual alignment, supported by a new diverse benchmark dataset.

Contribution

The paper proposes MSENet, a scale-aware, alignment-free detection network with a new benchmark dataset for RGBT video object detection.

Findings

01

MSENet effectively captures scale discrepancies without explicit alignment.

02

Deformable convolution mitigates spatial misalignment issues.

03

The new dataset provides a comprehensive platform for evaluation.

Abstract

Existing RGB-Thermal Video Object Detection (RGBT VOD) methods predominantly rely on the manual alignment of image pairs, that is both labor-intensive and time-consuming. This dependency significantly restricts the scalability and practical applicability of these methods in real-world scenarios. To address this critical limitation, we propose a novel framework termed the Mixture of Scale Experts Network (MSENet). MSENet integrates multiple experts trained at different perceptual scales, enabling the capture of scale discrepancies between RGB and thermal image pairs without the need for explicit alignment. Specifically, to address the issue of unaligned scales, MSENet introduces a set of experts designed to perceive the correlation between RGBT image pairs across various scales. These experts are capable of identifying and quantifying the scale differences inherent in the image pairs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Face and Expression Recognition · Advanced Neural Network Applications

MethodsDeformable Convolution · Convolution · Sparse Evolutionary Training