SENSE: Stereo OpEN Vocabulary SEmantic Segmentation
Thomas Campagnolo (ACENTAURI), Ezio Malis (ACENTAURI), Philippe Martinet (ACENTAURI), Ga\'etan Bahl

TL;DR
SENSE introduces stereo vision and language models to improve open-vocabulary semantic segmentation, achieving higher accuracy and better spatial reasoning, especially in complex scenes with occlusions.
Contribution
It is the first to leverage stereo vision for open-vocabulary segmentation, enhancing spatial reasoning and generalization in zero-shot scenarios.
Findings
+2.9% AP on PhraseStereo over baseline
+0.76% AP over best competing method
+3.5% mIoU on Cityscapes, +18% on KITTI
Abstract
Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
