X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li,, Shuguang Cui

TL;DR
X-Trans2Cap leverages cross-modal knowledge transfer via Transformer to enhance 3D dense captioning, enabling faithful descriptions from point clouds alone by distilling 2D image features during training.
Contribution
The paper introduces a novel cross-modal knowledge transfer framework using Transformer for 3D dense captioning, improving single-modal performance without extra inference costs.
Findings
Outperforms previous state-of-the-art by +21 CIDEr on ScanRefer
Achieves +16 CIDEr improvement on Nr3D
Effectively transfers 2D appearance features to 3D captioning
Abstract
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization
