Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai

TL;DR
This paper introduces SpaCap3D, a transformer-based model that leverages spatial relations to improve 3D dense captioning of objects in point cloud scenes, outperforming existing methods.
Contribution
The paper proposes a novel spatiality-guided encoder-decoder architecture for 3D dense captioning, emphasizing spatial relation learning to enhance caption accuracy.
Findings
Outperforms baseline by 4.94% and 9.61% in [email protected] on two datasets
Introduces spatial relation learning in 3D captioning models
Demonstrates the effectiveness of spatiality-guided encoding in scene understanding
Abstract
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
