ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Weixian Lei; Yixiao Ge; Jianfeng Zhang; Dylan Sun; Kun Yi; Ying Shan,; Mike Zheng Shou

arXiv:2308.10185·cs.CV·March 27, 2024

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan,, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

ViT-Lens introduces a unified framework that leverages pre-trained vision transformers and modality-specific lenses to efficiently learn and align representations across multiple modalities, demonstrated here with 3D data, enabling zero-shot tasks.

Contribution

The paper proposes ViT-Lens, a novel approach that aligns multimodal signals into a shared space using a pre-trained ViT and modality-specific lenses, facilitating scalable omni-modal learning.

Findings

01

Achieves 52.0% accuracy in zero-shot 3D classification on Objaverse-LVIS

02

Enables zero-shot 3D question-answering without additional training

03

Demonstrates effective modality alignment and emergent capabilities

Abstract

Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TencentARC/ViT-Lens
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning