LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

Kuangdai Leng; Simon Jeffery; Panos Panagos; and Tarje Nissen-Meyer

arXiv:2605.04323·cs.LG·May 11, 2026

LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

Kuangdai Leng, Simon Jeffery, Panos Panagos, and Tarje Nissen-Meyer

PDF

1 Models 1 Datasets

TL;DR

LUCAS-MEGA is a comprehensive, multimodal soil dataset with a novel data fusion pipeline, enabling advanced representation learning and predictive modeling in soil-environment research.

Contribution

The paper introduces LUCAS-MEGA, a large-scale, multimodal soil dataset and SoilFuser, a data fusion pipeline, facilitating high-dimensional representation learning and modeling.

Findings

01

Pretrained SoilFormer achieves strong predictive performance.

02

Representations recover known soil process relationships.

03

Dataset supports uncertainty-aware predictions.

Abstract

Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
earthroverprogram/soilformer
model

Datasets

earthroverprogram/lucas-mega
dataset· 349 dl
349 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.