Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou; Xuenjie Xie; Panfeng Li; Albrecht Kunz; Ahmad Osman; Xavier Maldague

arXiv:2602.11804·cs.CV·February 13, 2026

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague

PDF

Open Access

TL;DR

This paper introduces a lightweight RGB-D fusion approach that enhances segmentation performance using monocular depth priors, trained on significantly less data than traditional models.

Contribution

The authors develop a novel depth-aware fusion framework that improves segmentation accuracy with limited training data, reducing reliance on large-scale datasets.

Findings

01

Outperforms EfficientViT-SAM in accuracy

02

Requires only 11.2k training samples

03

Utilizes monocular depth priors for geometric enhancement

Abstract

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Medical Image Segmentation Techniques