Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image   Transformers Help 3D Representation Learning?

Runpei Dong; Zekun Qi; Linfeng Zhang; Junbo Zhang; Jianjian Sun; Zheng; Ge; Li Yi; Kaisheng Ma

arXiv:2212.08320·cs.CV·February 3, 2023·21 cites

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng, Ge, Li Yi, Kaisheng Ma

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces a novel method where pretrained 2D image and language Transformers serve as teachers to improve 3D representation learning via autoencoders, achieving state-of-the-art results in 3D classification tasks.

Contribution

It proposes a unified masked modeling framework using cross-modal knowledge distillation with frozen pretrained Transformers as teachers for 3D autoencoder training.

Findings

01

Achieves 88.21% accuracy on ScanObjectNN benchmark.

02

Demonstrates effective transfer of knowledge from 2D and language models to 3D tasks.

03

Sets new state-of-the-art in 3D representation learning.

Abstract

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · 3D Surveying and Cultural Heritage

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Residual Connection · Label Smoothing · Adam