Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Heng Wang; Chaoyi Zhang; Jianhui Yu; Weidong Cai

arXiv:2204.10688·cs.CV·April 25, 2022

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces SpaCap3D, a transformer-based model that leverages spatial relations to improve 3D dense captioning of objects in point cloud scenes, outperforming existing methods.

Contribution

The paper proposes a novel spatiality-guided encoder-decoder architecture for 3D dense captioning, emphasizing spatial relation learning to enhance caption accuracy.

Findings

01

Outperforms baseline by 4.94% and 9.61% in [email protected] on two datasets

02

Introduces spatial relation learning in 3D captioning models

03

Demonstrates the effectiveness of spatiality-guided encoding in scene understanding

Abstract

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heng-hw/spacap3d
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax