PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction

Xueheng Li; Tao Hu; Ke Cao; Runsheng Qi; Huixin Zhang; Rui Li; Jie Zhang; Chengjun Xie

arXiv:2604.17278·cs.CV·April 21, 2026

PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction

Xueheng Li, Tao Hu, Ke Cao, Runsheng Qi, Huixin Zhang, Rui Li, Jie Zhang, Chengjun Xie

PDF

TL;DR

PestVL-Net is a novel multimodal framework combining vision and language models to improve fine-grained pest recognition, leveraging expert knowledge and advanced architectures for practical agricultural pest management.

Contribution

The paper introduces PestVL-Net, a new vision-language model with a specialized architecture and multimodal reasoning for enhanced pest identification.

Findings

01

PestVL-Net outperforms existing methods on pest datasets.

02

The model effectively integrates visual features with semantic descriptions.

03

Experimental results demonstrate its potential for real-world pest management.

Abstract

Effective pest recognition and management are crucial for sustainable agricultural development. However, collecting pest data in real scenarios is often challenging. Compared to other domains, pests exhibit a wide variety of species with complex and diverse morphological characteristics. Existing techniques struggle to effectively model the key visual and high-level semantic features of pests in a fine-grained manner. These limitations hinder the practical application of such methods in real agricultural scenarios. To address these critical challenges, we present a synergistic approach that integrates PestVL-Net, a novel vision-language framework, with two multi-species pest datasets to facilitate fine-grained pest learning. The visual pathway of PestVL-Net utilizes the Recurrent Weighted Key Value (RWKV) architecture, incorporating a saliency-guided adaptive window partitioning scheme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.