Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers
S\"ohnke Benedikt Fischedick, Daniel Seichter, Benedict Stephan, Robin Schmidt, Horst-Michael Gross

TL;DR
This paper introduces DVEFormer, an efficient RGB-D Transformer model that predicts dense visual embeddings for robotics, enabling real-time semantic understanding and flexible text-based querying in indoor environments.
Contribution
The paper presents DVEFormer, a novel knowledge distillation approach that produces dense visual embeddings for semantic segmentation and natural language querying, optimized for real-time robotic applications.
Findings
Achieves 26.3 FPS on full model and 77.0 FPS on smaller variant.
Demonstrates competitive performance on indoor datasets.
Enables flexible text-based querying and 3D mapping integration.
Abstract
In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
