ViT-Lens: Towards Omni-modal Representations
Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun,, Yuying Ge, Ying Shan, Mike Zheng Shou

TL;DR
ViT-Lens-2 introduces a unified, efficient approach to learn representations across diverse modalities by aligning them with pre-trained vision transformers, enabling zero-shot multimodal understanding and generation.
Contribution
The paper presents ViT-Lens-2, a novel method for omni-modal representation learning that effectively adapts pretrained ViTs to new modalities with minimal data and aligns them for shared understanding.
Findings
Achieved state-of-the-art results on various understanding tasks.
Enabled zero-shot text and image generation for new modalities.
Demonstrated effective learning for 3D, audio, tactile, and EEG data.
Abstract
Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Tactile and Sensory Interactions
MethodsSparse Evolutionary Training · Focus
