Glass Segmentation with Fusion of Learned and General Visual Features
Risto Ojala, Tristan Ellison, Mo Chen

TL;DR
This paper introduces a novel glass segmentation architecture combining learned and general visual features, achieving state-of-the-art accuracy and efficiency on multiple datasets.
Contribution
The paper proposes a dual-backbone architecture utilizing a frozen DINOv3 model and a trained Swin model for improved glass segmentation.
Findings
Achieved state-of-the-art accuracy on four glass segmentation datasets.
The model has competitive inference speed, surpassing previous methods with a lighter backbone.
The approach effectively combines learned and general features for transparent object segmentation.
Abstract
Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Industrial Vision Systems and Defect Detection
