Scenario Understanding of Traffic Scenes Through Large Visual Language   Models

Esteban Rivera; Jannik L\"ubberstedt; Nico Uhlemann; Markus Lienkamp

arXiv:2501.17131·cs.CV·April 8, 2025

Scenario Understanding of Traffic Scenes Through Large Visual Language Models

Esteban Rivera, Jannik L\"ubberstedt, Nico Uhlemann, Markus Lienkamp

PDF

Open Access

TL;DR

This paper evaluates large visual language models like GPT-4 and LLaVA for understanding and classifying urban traffic scenes, aiming to improve autonomous driving systems' generalization and reduce manual annotation efforts.

Contribution

It introduces a scalable captioning pipeline using LVLMs for urban traffic scene understanding and demonstrates their effectiveness on multiple datasets.

Findings

01

LVLMs effectively classify traffic scenes with high accuracy.

02

The proposed pipeline enables flexible deployment on new datasets.

03

LVLMs reduce the need for manual annotation in scene categorization.

Abstract

Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer