VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites
Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda

TL;DR
VisText-Mosquito is a comprehensive multimodal dataset and model framework for detecting, segmenting, and explaining mosquito breeding sites to aid in disease prevention.
Contribution
The paper introduces a new multimodal dataset and fine-tuned models for automated mosquito breeding site analysis using visual and textual data.
Findings
YOLOv9s achieved 0.92926 precision in detection
YOLOv11n-Seg reached 0.91587 segmentation precision
Mosquito-LLaMA3-8B achieved BLEU score of 54.7 in explanation generation
Abstract
Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and explanation for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language explanation texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For textual explanation generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
