X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D   Dense Captioning

Zhihao Yuan; Xu Yan; Yinghong Liao; Yao Guo; Guanbin Li; Zhen Li,; Shuguang Cui

arXiv:2203.00843·cs.CV·April 7, 2022

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li,, Shuguang Cui

PDF

Open Access 1 Repo

TL;DR

X-Trans2Cap leverages cross-modal knowledge transfer via Transformer to enhance 3D dense captioning, enabling faithful descriptions from point clouds alone by distilling 2D image features during training.

Contribution

The paper introduces a novel cross-modal knowledge transfer framework using Transformer for 3D dense captioning, improving single-modal performance without extra inference costs.

Findings

01

Outperforms previous state-of-the-art by +21 CIDEr on ScanRefer

02

Achieves +16 CIDEr improvement on Nr3D

03

Effectively transfers 2D appearance features to 3D captioning

Abstract

3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

curryyuan/x-trans2cap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization