Enhancing long-range depth estimation via heterogeneous CNN-transformer encoding and cross-dimensional semantic fusion

Yunhao Chen; Qian Yin; Li Zhao; Jianlong Wang; Sida Zhou; Jianing Tang

PMC · DOI:10.1038/s41598-026-36755-0·February 17, 2026

Enhancing long-range depth estimation via heterogeneous CNN-transformer encoding and cross-dimensional semantic fusion

Yunhao Chen, Qian Yin, Li Zhao, Jianlong Wang, Sida Zhou, Jianing Tang

PDF

Open Access

TL;DR

A new framework improves depth estimation in distant regions by combining CNNs and transformers with a novel fusion module.

Contribution

A novel monocular depth estimation framework with a heterogeneous encoder and Cross-dimensional Semantic Fusion module.

Findings

01

The framework achieves 0.050 Abs-Rel and 2.107 RMSE on the KITTI dataset.

02

It demonstrates strong generalization with 0.142 Abs-Rel on the SUN RGB-D dataset.

03

The CSF module enhances feature aggregation for distant objects with low pixel occupancy.

Abstract

Monocular depth estimation enables 3D scene reconstruction from a single 2D image, offering a cost-effective solution widely applied in autonomous driving and UAVs. However, existing deep neural networks often fail to balance local texture details with global contextual information, leading to significant inaccuracies in distant-region depth prediction. To address this challenge, we introduce a novel monocular depth estimation framework featuring a heterogeneous encoder and a Cross-dimensional Semantic Fusion (CSF) module. The heterogeneous encoder integrates the initial convolutional layers of ResNet-50 with the hierarchical attention mechanism of Swin Transformer to efficiently capture both local details and long-range dependencies. Specifically targeting the characteristics of distant objects—low pixel occupancy but high semantic relevance—the CSF module enhances feature aggregation…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

CSF2

Proteins1

Cell lines1

SUN— Homo sapiens (Human) · Chronic myelogenous leukemia, BCR-ABL1 positive · Cancer cell line

Chemicals1

KITTI

Diseases1

CCF

Mutations1

A6000T

Figures13

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Robotics and Sensor-Based Localization