Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

S\"ohnke Benedikt Fischedick; Daniel Seichter; Benedict Stephan; Robin Schmidt; Horst-Michael Gross

arXiv:2601.00359·cs.CV·January 5, 2026

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

S\"ohnke Benedikt Fischedick, Daniel Seichter, Benedict Stephan, Robin Schmidt, Horst-Michael Gross

PDF

Open Access

TL;DR

This paper introduces DVEFormer, an efficient RGB-D Transformer model that predicts dense visual embeddings for robotics, enabling real-time semantic understanding and flexible text-based querying in indoor environments.

Contribution

The paper presents DVEFormer, a novel knowledge distillation approach that produces dense visual embeddings for semantic segmentation and natural language querying, optimized for real-time robotic applications.

Findings

01

Achieves 26.3 FPS on full model and 77.0 FPS on smaller variant.

02

Demonstrates competitive performance on indoor datasets.

03

Enables flexible text-based querying and 3D mapping integration.

Abstract

In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization