Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for Volume Estimation from RGB Images

Olivia Zumsteg; Nico Graf; Aaron Haeusler; Norbert Kirchgessner; Nicola Storni; Lukas Roth; Andreas Hund

arXiv:2506.18060·cs.CV·December 30, 2025

Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for Volume Estimation from RGB Images

Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni, Lukas Roth, Andreas Hund

PDF

TL;DR

This study develops a deep learning-based method using fine-tuned Vision Transformers to accurately estimate wheat spike volume from RGB images, outperforming traditional geometric and CNN-based approaches in field conditions.

Contribution

It introduces a novel pipeline employing fine-tuned Vision Transformers for non-destructive wheat spike volume estimation from RGB images, achieving high accuracy and robustness.

Findings

01

Fine-tuned Vision Transformers outperform CNNs and geometric baselines.

02

Deep-supervised LSTMs excel with frozen DINO backbones.

03

Object shape complexity affects geometric method accuracy.

Abstract

Estimating three-dimensional morphological traits such as volume from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes using RGB images and structured-light 3D scans as ground truth references. Wheat spike volume is promising for phenotyping as it shows high correlation with spike dry weight, a key component of fruiting efficiency. Accounting for the complex geometry of the spikes, we compare different neural network approaches for volume estimation from 2D images and benchmark them against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Fine-tuned Vision Transformers (DINOv2 and DINOv3) with MLPs achieve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.