Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding
Liwen Liu, Weidong Yang, Lipeng Ma, Ben Fei

TL;DR
This paper introduces MMPT, a multi-task pre-training framework that leverages three self-supervised tasks across 3D point clouds and 2D images to improve understanding without needing 3D annotations.
Contribution
The paper presents a novel multi-task pre-training approach combining token-level, point-level, and contrastive learning for enhanced 3D point cloud understanding.
Findings
Outperforms state-of-the-art methods on multiple benchmarks.
Effective in both discriminant and generative tasks.
Operates without requiring 3D annotations.
Abstract
Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications · Image Processing and 3D Reconstruction
