Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng, Ge, Li Yi, Kaisheng Ma

TL;DR
This paper introduces a novel method where pretrained 2D image and language Transformers serve as teachers to improve 3D representation learning via autoencoders, achieving state-of-the-art results in 3D classification tasks.
Contribution
It proposes a unified masked modeling framework using cross-modal knowledge distillation with frozen pretrained Transformers as teachers for 3D autoencoder training.
Findings
Achieves 88.21% accuracy on ScanObjectNN benchmark.
Demonstrates effective transfer of knowledge from 2D and language models to 3D tasks.
Sets new state-of-the-art in 3D representation learning.
Abstract
The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · 3D Surveying and Cultural Heritage
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Residual Connection · Label Smoothing · Adam
