Delving into Multi-modal Multi-task Foundation Models for Road Scene   Understanding: From Learning Paradigm Perspectives

Sheng Luo; Wei Chen; Wanxin Tian; Rui Liu; Luanxuan Hou; Xiubao Zhang,; Haifeng Shen; Ruiqi Wu; Shuyi Geng; Yi Zhou; Ling Shao; Yi Yang; Bojun Gao,; Qun Li; Guobin Wu

arXiv:2402.02968·cs.CV·May 28, 2024·1 cites

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang,, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, Ling Shao, Yi Yang, Bojun Gao,, Qun Li, Guobin Wu

PDF

Open Access 1 Repo

TL;DR

This survey analyzes multi-modal multi-task foundation models for road scene understanding, emphasizing their learning paradigms, capabilities, challenges, and future directions in intelligent vehicle systems.

Contribution

It provides a comprehensive overview of MM-VUFMs for road scenes, highlighting recent practices, paradigms, and future research trends in the field.

Findings

01

MM-VUFMs effectively fuse multi-modal data for holistic scene understanding.

02

Advanced learning paradigms enhance adaptability and robustness of models.

03

Key challenges include interpretability and closed-loop system integration.

Abstract

Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rolsheng/mm-vufm4ds
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Natural Language Processing Techniques