Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Jingtian Ma; Jingyuan Wang; Wayne Xin Zhao; Guoping Liu; Xiang Wen

arXiv:2511.08978·cs.MM·November 13, 2025

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Jingtian Ma, Jingyuan Wang, Wayne Xin Zhao, Guoping Liu, Xiang Wen

PDF

Open Access

TL;DR

This paper introduces a novel spatio-temporal enhanced vision-language model, ST-CLIP, for traffic scene understanding, effectively integrating spatio-temporal data with visual-textual analysis to improve scene comprehension.

Contribution

The paper presents the first integration of spatio-temporal information into vision-language models for traffic scene understanding, using a novel prompt learning approach.

Findings

01

Superior performance on real-world datasets

02

Effective in complex scene understanding scenarios

03

Works well with few-shot learning strategies

Abstract

Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications