Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing
Sicen Guo, Tianyou Wen, Chuang-Wei Liu, Qijun Chen, Rui Fan

TL;DR
This paper introduces a novel Heterogeneous Feature Integration Transformer (HFIT) that leverages vision foundation models for improved RGB-depth driving scene parsing without re-training ViTs, demonstrating superior performance on benchmark datasets.
Contribution
The paper proposes a new HFIT architecture that exploits VFMs for RGB-depth scene parsing, utilizing inherent data characteristics and relative depth prediction to enhance generalization.
Findings
HFIT outperforms traditional and existing VFM-based scene parsing methods.
Relative depth prediction from VFMs effectively replaces depth maps.
The approach achieves state-of-the-art results on Cityscapes and KITTI datasets.
Abstract
Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Image and Object Detection Techniques · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Vision Transformer
