Toward Unified Multimodal Representation Learning for Autonomous Driving
Ximeng Tao, Dimitar Filev, Gaurav Pandey

TL;DR
This paper introduces a unified multimodal pre-training framework for autonomous driving that aligns visual, textual, and 3D data in a shared space, improving scene understanding beyond pairwise methods.
Contribution
It proposes a Contrastive Tensor Pre-training (CTP) method that jointly aligns multiple modalities using a tensor loss, extending beyond traditional pairwise similarity approaches.
Findings
Achieves better multimodal alignment in autonomous driving scenarios.
Effective for both pretrained and from-scratch encoder training.
Constructed a new triplet dataset for validation.
Abstract
Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
