STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger; Du\v{s}an Mali\'c; Wei Lin; David Schinagl; Samuel Schulter; Horst Possegger

arXiv:2506.06218·cs.CV·June 9, 2025

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger, Du\v{s}an Mali\'c, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger

PDF

Open Access 1 Repo 1 Video

TL;DR

STSBench is a comprehensive benchmark framework that evaluates vision-language models' spatio-temporal reasoning in autonomous driving scenarios, revealing critical gaps and guiding future model improvements.

Contribution

Introduces STSBench, a novel scenario-based benchmark for assessing spatio-temporal reasoning of vision-language models in autonomous driving.

Findings

01

Existing models struggle with traffic dynamics reasoning.

02

The benchmark reveals significant shortcomings in current VLMs.

03

Highlights need for architectures that explicitly model spatio-temporal reasoning.

Abstract

We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lrp-ivc/stsbench
noneOfficial

Videos

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications

MethodsFocus