# Lightweight Multimodal Fusion for Urban Tree Health and Ecosystem Services

**Authors:** Abror Buriboev, Djamshid Sultanov, Ilhom Rahmatullaev, Ozod Yusupov, Erali Eshonqulov, Dilshod Bekmuradov, Nodir Egamberdiev, Andrew Jaeyong Choi

PMC · DOI: 10.3390/s26010007 · Sensors (Basel, Switzerland) · 2025-12-19

## TL;DR

This paper introduces a lightweight AI system that combines camera images and sensor data to assess urban tree health and calculate ecosystem benefits like oxygen and CO2 levels.

## Contribution

A novel lightweight multimodal deep-learning framework for real-time tree health and ecosystem service estimation using RGB and sensor data.

## Key findings

- The model achieves 92.03% accuracy in tree health classification.
- It reduces regression error for oxygen and CO2 estimation compared to other methods.
- The system is efficient with 5.4 million parameters and 38 ms inference latency for edge deployment.

## Abstract

Rapid urban expansion has heightened the demand for accurate, scalable, and real-time methods to assess tree health and the provision of ecosystem services. Urban trees are the major contributors to air-quality improvement and climate change mitigation; however, their monitoring is mostly constrained to inherently subjective and inefficient manual inspections. In order to break this barrier, we put forward a lightweight multimodal deep-learning framework that fuses RGB imagery with environmental and biometric sensor data for a combined evaluation of tree-health condition as well as the estimation of the daily oxygen production and CO2 absorption. The proposed architecture features an EfficientNet-B0 vision encoder upgraded with Mobile Inverted Bottleneck Convolutions (MBConv) and a squeeze-and-excitation attention mechanism, along with a small multilayer perceptron for sensor processing. A common multimodal representation facilitates a three-task learning set-up, thus allowing simultaneous classification and regression within a single model. Our experiments with a carefully curated dataset of segmented tree images accompanied by synchronized sensor measurements show that our method attains a health-classification accuracy of 92.03% while also lowering the regression error for O2 (MAE = 1.28) and CO2 (MAE = 1.70) in comparison with unimodal and multimodal baselines. The proposed architecture, with its 5.4 million parameters and an inference latency of 38 ms, can be readily deployed on edge devices and real-time monitoring platforms.

## Full-text entities

- **Chemicals:** O2 (MESH:D010100), CO2 (MESH:D002245)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12787558/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12787558/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12787558/full.md

---
Source: https://tomesphere.com/paper/PMC12787558