MTA: Multimodal Task Alignment for BEV Perception and Captioning

Yunsheng Ma; Burhaneddin Yaman; Xin Ye; Jingru Luo; Feng Tao; Abhirup; Mallik; Ziran Wang; Liu Ren

arXiv:2411.10639·cs.CV·March 12, 2025

MTA: Multimodal Task Alignment for BEV Perception and Captioning

Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Jingru Luo, Feng Tao, Abhirup, Mallik, Ziran Wang, Liu Ren

PDF

Open Access

TL;DR

This paper introduces MTA, a framework that aligns BEV perception and captioning tasks to improve autonomous driving scene understanding without extra runtime costs.

Contribution

MTA proposes a novel multimodal alignment framework that enhances both BEV perception and captioning by integrating alignment mechanisms during training.

Findings

01

Significant performance improvements on nuScenes and TOD3Cap datasets.

02

10.7% better in rare perception scenarios.

03

9.2% improvement in captioning accuracy.

Abstract

Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one task and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques