SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

Xiantai Xiang; Guangyao Zhou; Zixiao Wen; Wenshuai Li; Ben Niu; Feng Wang; Lijia Huang; Qiantong Wang; Yuhan Liu; Zongxu Pan; Yuxin Hu

arXiv:2601.02249·cs.CV·January 6, 2026

SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, Yuxin Hu

PDF

Open Access

TL;DR

SLGNet introduces a novel multimodal object detection framework that combines structural priors and language-guided modulation within a frozen ViT backbone, enhancing robustness and efficiency in challenging environments.

Contribution

The paper proposes SLGNet, a parameter-efficient model that integrates hierarchical structural priors and language-driven feature modulation to improve multimodal detection performance.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Reduces trainable parameters by approximately 87%.

03

Enhances robustness in complex, dynamic scenes.

Abstract

Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning