EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

Gbenga Omotara; Ramy Farag; Seyed Mohamad Ali Tousi; G.N. DeSouza

arXiv:2511.14970·cs.CV·November 20, 2025

EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects

Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G.N. DeSouza

PDF

Open Access

TL;DR

This paper introduces EGSA, a boundary-aware fusion mechanism, and a progressive training strategy that together improve depth estimation and segmentation of transparent objects without requiring ground-truth depth during training.

Contribution

The paper proposes a novel edge-guided spatial attention fusion method and a multi-modal progressive training approach for better transparent object perception.

Findings

01

EGSA improves depth accuracy over state-of-the-art methods.

02

The progressive training strategy eliminates the need for ground-truth depth.

03

Significant improvements in transparent regions on Syn-TODD and ClearPose benchmarks.

Abstract

Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Visual Attention and Saliency Detection