TL;DR
SPROUT is a scalable diffusion-based foundation model specifically designed for agricultural vision tasks, trained on a large dataset, outperforming existing models with lower pre-training costs.
Contribution
Introduces SPROUT, a novel diffusion transformer model for agriculture, trained on 2.6 million images, achieving superior performance over prior models.
Findings
SPROUT outperforms state-of-the-art models on various agricultural tasks.
It requires significantly less pre-training cost than existing models.
The model effectively learns structure-aware representations through diffusion denoising.
Abstract
Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce (calable lant epresentation model via pen-field nsupervised raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
