Toward Unified Multimodal Representation Learning for Autonomous Driving

Ximeng Tao; Dimitar Filev; Gaurav Pandey

arXiv:2603.07874·cs.CV·March 10, 2026

Toward Unified Multimodal Representation Learning for Autonomous Driving

Ximeng Tao, Dimitar Filev, Gaurav Pandey

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a unified multimodal pre-training framework for autonomous driving that aligns visual, textual, and 3D data in a shared space, improving scene understanding beyond pairwise methods.

Contribution

It proposes a Contrastive Tensor Pre-training (CTP) method that jointly aligns multiple modalities using a tensor loss, extending beyond traditional pairwise similarity approaches.

Findings

01

Achieves better multimodal alignment in autonomous driving scenarios.

02

Effective for both pretrained and from-scratch encoder training.

03

Constructed a new triplet dataset for validation.

Abstract

Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Ximeng0831/CTP
model

Datasets

Ximeng0831/CTP-Dataset
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications