Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers
Hanna Herasimchyk, Robin Labryga, Tomislav Prusina

TL;DR
This paper introduces a multi-head vision transformer model with metadata integration for multi-label plant species prediction, effectively handling domain shifts and achieving top performance in the PlantCLEF 2025 challenge.
Contribution
It proposes a novel multi-scale tiling, dynamic thresholding, and ensemble strategies within a vision transformer framework for improved multi-label plant classification.
Findings
Achieved 3rd place on the private leaderboard.
Effective handling of domain shift from single-species to multi-species images.
Utilized large-scale training data with over 1.4 million images.
Abstract
We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
