Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Hanna Herasimchyk; Robin Labryga; Tomislav Prusina

arXiv:2508.10457·cs.CV·August 15, 2025

Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina

PDF

TL;DR

This paper introduces a multi-head vision transformer model with metadata integration for multi-label plant species prediction, effectively handling domain shifts and achieving top performance in the PlantCLEF 2025 challenge.

Contribution

It proposes a novel multi-scale tiling, dynamic thresholding, and ensemble strategies within a vision transformer framework for improved multi-label plant classification.

Findings

01

Achieved 3rd place on the private leaderboard.

02

Effective handling of domain shift from single-species to multi-species images.

03

Utilized large-scale training data with over 1.4 million images.

Abstract

We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.