CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments
Lemin Liu, Fangchao Hu, Honghua Jiang, Yaru Chen, Limin Liu, Yongliang Qiao

TL;DR
This paper introduces CT-CLIP, a multi-modal fusion framework combining CNN, Transformer, and CLIP for robust apple leaf disease recognition in complex environments, achieving high accuracy and addressing lesion variability.
Contribution
It proposes a novel multi-branch framework with adaptive feature fusion and multimodal image-text learning, improving recognition accuracy under complex backgrounds and few-shot conditions.
Findings
Achieves over 97% accuracy on apple disease datasets.
Effectively fuses local and global features for better recognition.
Outperforms baseline methods in complex environments.
Abstract
In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
