ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan,, Mike Zheng Shou

TL;DR
ViT-Lens introduces a unified framework that leverages pre-trained vision transformers and modality-specific lenses to efficiently learn and align representations across multiple modalities, demonstrated here with 3D data, enabling zero-shot tasks.
Contribution
The paper proposes ViT-Lens, a novel approach that aligns multimodal signals into a shared space using a pre-trained ViT and modality-specific lenses, facilitating scalable omni-modal learning.
Findings
Achieves 52.0% accuracy in zero-shot 3D classification on Objaverse-LVIS
Enables zero-shot 3D question-answering without additional training
Demonstrates effective modality alignment and emergent capabilities
Abstract
Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
