MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers
Zichao Dong, Yilin Zhang, Xufeng Huang, Hang Ji, Zhan Shi, Xin Zhan,, Junbo Chen

TL;DR
MV-DETR is a novel multi-modality transformer-based detection pipeline that effectively combines geometry and texture cues from RGBD data, achieving state-of-the-art indoor object detection results on ScanNetV2.
Contribution
The paper introduces a lightweight visual-geometric module and demonstrates the importance of separate encoding of geometry and texture cues in RGBD object detection.
Findings
Achieves 78% AP on ScanNetV2, setting a new state-of-the-art.
Effectively leverages pretrained visual encoders for texture features.
Demonstrates the importance of separate geometry and texture encoding.
Abstract
We introduce a novel MV-DETR pipeline which is effective while efficient transformer based detection method. Given input RGBD data, we notice that there are super strong pretraining weights for RGB data while less effective works for depth related data. First and foremost , we argue that geometry and texture cues are both of vital importance while could be encoded separately. Secondly, we find that visual texture feature is relatively hard to extract compared with geometry feature in 3d space. Unfortunately, single RGBD dataset with thousands of data is not enough for training an discriminating filter for visual texture feature extraction. Last but certainly not the least, we designed a lightweight VG module consists of a visual textual encoder, a geometry encoder and a VG connector. Compared with previous state of the art works like V-DETR, gains from pretrained visual encoder could be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · IoT-based Smart Home Systems
