Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Xian-Hong Huang, Hui-Kai Su, Chi-Chia Sun, Jun-Wei Hsieh

TL;DR
This paper presents a novel cross-modal tiny object detection approach that combines semantic-guided natural language processing with advanced visual backbones, achieving superior accuracy and efficiency on standard datasets.
Contribution
It introduces a new fusion method integrating BERT with CNN-based backbones like ELAN, MSP, and CSP for improved tiny object detection performance.
Findings
Achieves 52.6% AP on COCO2017 validation set.
Outperforms YOLO-World while using fewer parameters than Transformer-based models.
Demonstrates robustness across multiple backbone architectures.
Abstract
This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Technologies in Various Fields
