Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
Mridankan Mandal

TL;DR
This study reveals that in agricultural biomass estimation with limited data, simpler fusion modules outperform complex attention-based models, emphasizing backbone quality and local modules.
Contribution
It introduces the concept of 'fusion complexity inversion' and systematically evaluates vision models on pasture biomass data, highlighting the importance of backbone pretraining.
Findings
Two-layer gated depthwise convolution outperforms attention transformers and SSMs.
Backbone pretraining scale has a dominant impact on performance.
Metadata training caps maximum achievable R^2 at approximately 0.829.
Abstract
Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
