# Enhancing long-range depth estimation via heterogeneous CNN-transformer encoding and cross-dimensional semantic fusion

**Authors:** Yunhao Chen, Qian Yin, Li Zhao, Jianlong Wang, Sida Zhou, Jianing Tang

PMC · DOI: 10.1038/s41598-026-36755-0 · 2026-02-17

## TL;DR

A new framework improves depth estimation in distant regions by combining CNNs and transformers with a novel fusion module.

## Contribution

A novel monocular depth estimation framework with a heterogeneous encoder and Cross-dimensional Semantic Fusion module.

## Key findings

- The framework achieves 0.050 Abs-Rel and 2.107 RMSE on the KITTI dataset.
- It demonstrates strong generalization with 0.142 Abs-Rel on the SUN RGB-D dataset.
- The CSF module enhances feature aggregation for distant objects with low pixel occupancy.

## Abstract

Monocular depth estimation enables 3D scene reconstruction from a single 2D image, offering a cost-effective solution widely applied in autonomous driving and UAVs. However, existing deep neural networks often fail to balance local texture details with global contextual information, leading to significant inaccuracies in distant-region depth prediction. To address this challenge, we introduce a novel monocular depth estimation framework featuring a heterogeneous encoder and a Cross-dimensional Semantic Fusion (CSF) module. The heterogeneous encoder integrates the initial convolutional layers of ResNet-50 with the hierarchical attention mechanism of Swin Transformer to efficiently capture both local details and long-range dependencies. Specifically targeting the characteristics of distant objects—low pixel occupancy but high semantic relevance—the CSF module enhances feature aggregation in the decoder through multi-scale interactions and spatial-channel coupling. Additionally, the decoder incorporates a Depth-Separable Upsampling Block (DSUB) and a Multi-scale Self-Attention (MSA) module to refine detail restoration and ensure spatial consistency. Experiments validate the superiority of our method. On the KITTI dataset, it achieves leading results: 0.050 Abs-Rel, 2.107 RMSE, and a long-range error of 0.2725. The SUN RGB-D dataset demonstrates strong generalization with an Abs-Rel of 0.142. This framework significantly advances long-range depth estimation research and shows broad application prospects.

The online version contains supplementary material available at 10.1038/s41598-026-36755-0.

## Full-text entities

- **Genes:** CSF2 (colony stimulating factor 2) [NCBI Gene 1437] {aka CSF, GMCSF}
- **Diseases:** CCF (MESH:C538175)
- **Chemicals:** KITTI (-)
- **Mutations:** A6000T
- **Cell lines:** SUN — Homo sapiens (Human), Chronic myelogenous leukemia, BCR-ABL1 positive, Cancer cell line (CVCL_XU03)

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13002885/full.md

---
Source: https://tomesphere.com/paper/PMC13002885