LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
Kuangdai Leng, Simon Jeffery, Panos Panagos, and Tarje Nissen-Meyer

TL;DR
LUCAS-MEGA is a comprehensive, multimodal soil dataset with a novel data fusion pipeline, enabling advanced representation learning and predictive modeling in soil-environment research.
Contribution
The paper introduces LUCAS-MEGA, a large-scale, multimodal soil dataset and SoilFuser, a data fusion pipeline, facilitating high-dimensional representation learning and modeling.
Findings
Pretrained SoilFormer achieves strong predictive performance.
Representations recover known soil process relationships.
Dataset supports uncertainty-aware predictions.
Abstract
Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
