CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion
Haotian Dong (1), Enhui Ma (1), Lubo Wang (1), Miaohui Wang (2),, Wuyuan Xie (2), Qing Guo (3), Ping Li (4), Lingyu Liang (5), Kairui Yang (6),, Di Lin (1) ((1) Tianjin University, (2) Shenzhen University, (3) A*STAR, (4), The Hong Kong Polytechnic University

TL;DR
CVSformer introduces a novel transformer-based approach for semantic scene completion that effectively models cross-view object relationships, leading to state-of-the-art results in 3D scene understanding.
Contribution
The paper proposes CVSformer, a new model combining multi-view feature synthesis and cross-view transformer to improve 3D scene completion accuracy.
Findings
Achieves state-of-the-art performance on public datasets.
Effectively models cross-view relationships for occluded object reasoning.
Outperforms existing voxel-based SSC methods.
Abstract
Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization
