Scenario Understanding of Traffic Scenes Through Large Visual Language Models
Esteban Rivera, Jannik L\"ubberstedt, Nico Uhlemann, Markus Lienkamp

TL;DR
This paper evaluates large visual language models like GPT-4 and LLaVA for understanding and classifying urban traffic scenes, aiming to improve autonomous driving systems' generalization and reduce manual annotation efforts.
Contribution
It introduces a scalable captioning pipeline using LVLMs for urban traffic scene understanding and demonstrates their effectiveness on multiple datasets.
Findings
LVLMs effectively classify traffic scenes with high accuracy.
The proposed pipeline enables flexible deployment on new datasets.
LVLMs reduce the need for manual annotation in scene categorization.
Abstract
Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
