A Text-Guided Vision Model for Enhanced Recognition of Small Instances
Hyun-Ki Jung

TL;DR
This paper introduces a text-guided vision model that improves small object detection accuracy and efficiency in drone applications by modifying the YOLOv8 architecture and optimizing processing speed.
Contribution
It presents a novel lightweight, efficient model that enhances small object detection through architectural modifications and parallel processing, outperforming previous models on the VisDrone dataset.
Findings
Precision increased from 40.6% to 41.6%.
Model size reduced from 4 million to 3.8 million parameters.
FLOPs decreased from 15.7 billion to 15.2 billion.
Abstract
As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · UAV Applications and Optimization · Advanced Image and Video Retrieval Techniques
