CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping
Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo, Steven Davy, Radu Timofte

TL;DR
This paper introduces a CLIP-guided multi-task framework for plant phenotyping that effectively predicts plant age and leaf count from multi-view images, improving accuracy and robustness over existing methods.
Contribution
It presents a novel level-aware vision language model that aggregates multi-view images into angle-invariant features and conditions predictions on viewpoint priors, simplifying and enhancing plant growth modeling.
Findings
Reduces age MAE from 7.74 to 3.91
Reduces leaf count MAE from 5.52 to 3.08
Improves robustness to missing views
Abstract
Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Agriculture and AI · Remote Sensing in Agriculture · Greenhouse Technology and Climate Control
