CVSformer: Cross-View Synthesis Transformer for Semantic Scene   Completion

Haotian Dong (1); Enhui Ma (1); Lubo Wang (1); Miaohui Wang (2),; Wuyuan Xie (2); Qing Guo (3); Ping Li (4); Lingyu Liang (5); Kairui Yang (6),; Di Lin (1) ((1) Tianjin University; (2) Shenzhen University; (3) A*STAR; (4); The Hong Kong Polytechnic University; (5) South China University of; Technology; (6) Alibaba Damo Academy)

arXiv:2307.07938·cs.CV·July 18, 2023

CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion

Haotian Dong (1), Enhui Ma (1), Lubo Wang (1), Miaohui Wang (2),, Wuyuan Xie (2), Qing Guo (3), Ping Li (4), Lingyu Liang (5), Kairui Yang (6),, Di Lin (1) ((1) Tianjin University, (2) Shenzhen University, (3) A*STAR, (4), The Hong Kong Polytechnic University

PDF

Open Access

TL;DR

CVSformer introduces a novel transformer-based approach for semantic scene completion that effectively models cross-view object relationships, leading to state-of-the-art results in 3D scene understanding.

Contribution

The paper proposes CVSformer, a new model combining multi-view feature synthesis and cross-view transformer to improve 3D scene completion accuracy.

Findings

01

Achieves state-of-the-art performance on public datasets.

02

Effectively models cross-view relationships for occluded object reasoning.

03

Outperforms existing voxel-based SSC methods.

Abstract

Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization