Hard Cases Detection in Motion Prediction by Vision-Language Foundation   Models

Yi Yang; Qingwen Zhang; Kei Ikemura; Nazre Batool; John Folkesson

arXiv:2405.20991·cs.CV·June 3, 2024

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

Yi Yang, Qingwen Zhang, Kei Ikemura, Nazre Batool, John Folkesson

PDF

Open Access 1 Repo

TL;DR

This paper explores using vision-language foundation models like GPT-4v to detect hard cases in autonomous driving scenarios, improving safety and training efficiency by identifying challenging situations in traffic prediction tasks.

Contribution

It introduces a novel pipeline leveraging VLMs for hard case detection in autonomous driving, enhancing data selection and model robustness.

Findings

01

VLMs effectively identify challenging traffic scenarios.

02

The pipeline improves training efficiency for motion prediction models.

03

Demonstrated on NuScenes dataset with state-of-the-art methods.

Abstract

Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kth-rpl/detect_vlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Dense Connections · Softmax · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Linear Layer