Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection

Xian-Hong Huang; Hui-Kai Su; Chi-Chia Sun; Jun-Wei Hsieh

arXiv:2511.05474·cs.CV·November 10, 2025

Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection

Xian-Hong Huang, Hui-Kai Su, Chi-Chia Sun, Jun-Wei Hsieh

PDF

Open Access

TL;DR

This paper presents a novel cross-modal tiny object detection approach that combines semantic-guided natural language processing with advanced visual backbones, achieving superior accuracy and efficiency on standard datasets.

Contribution

It introduces a new fusion method integrating BERT with CNN-based backbones like ELAN, MSP, and CSP for improved tiny object detection performance.

Findings

01

Achieves 52.6% AP on COCO2017 validation set.

02

Outperforms YOLO-World while using fewer parameters than Transformer-based models.

03

Demonstrates robustness across multiple backbone architectures.

Abstract

This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Technologies in Various Fields