Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification
Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak

TL;DR
This paper presents a novel tile-based Vision Transformer approach with visual-cluster priors for zero-shot multi-species plant identification, achieving competitive results without additional training.
Contribution
It introduces a tiling strategy combined with visual clustering and Bayesian priors for improved inference in plant identification tasks.
Findings
Achieved macro-averaged F1 of 0.348 on private leaderboard
Utilized a 4x4 tiling strategy aligned with network receptive field
No additional training required for the proposed method
Abstract
We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Advanced Neural Network Applications
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Transformer · Layer Normalization · Dense Connections · Vision Transformer
