ViT-Lens: Towards Omni-modal Representations

Weixian Lei; Yixiao Ge; Kun Yi; Jianfeng Zhang; Difei Gao; Dylan Sun,; Yuying Ge; Ying Shan; Mike Zheng Shou

arXiv:2311.16081·cs.CV·March 27, 2024·1 cites

ViT-Lens: Towards Omni-modal Representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun,, Yuying Ge, Ying Shan, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

ViT-Lens-2 introduces a unified, efficient approach to learn representations across diverse modalities by aligning them with pre-trained vision transformers, enabling zero-shot multimodal understanding and generation.

Contribution

The paper presents ViT-Lens-2, a novel method for omni-modal representation learning that effectively adapts pretrained ViTs to new modalities with minimal data and aligns them for shared understanding.

Findings

01

Achieved state-of-the-art results on various understanding tasks.

02

Enabled zero-shot text and image generation for new modalities.

03

Demonstrated effective learning for 3D, audio, tactile, and EEG data.

Abstract

Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TencentARC/ViT-Lens
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Tactile and Sensory Interactions

MethodsSparse Evolutionary Training · Focus