Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for Volume Estimation from RGB Images
Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni, Lukas Roth, Andreas Hund

TL;DR
This study develops a deep learning-based method using fine-tuned Vision Transformers to accurately estimate wheat spike volume from RGB images, outperforming traditional geometric and CNN-based approaches in field conditions.
Contribution
It introduces a novel pipeline employing fine-tuned Vision Transformers for non-destructive wheat spike volume estimation from RGB images, achieving high accuracy and robustness.
Findings
Fine-tuned Vision Transformers outperform CNNs and geometric baselines.
Deep-supervised LSTMs excel with frozen DINO backbones.
Object shape complexity affects geometric method accuracy.
Abstract
Estimating three-dimensional morphological traits such as volume from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes using RGB images and structured-light 3D scans as ground truth references. Wheat spike volume is promising for phenotyping as it shows high correlation with spike dry weight, a key component of fruiting efficiency. Accounting for the complex geometry of the spikes, we compare different neural network approaches for volume estimation from 2D images and benchmark them against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Fine-tuned Vision Transformers (DINOv2 and DINOv3) with MLPs achieve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
