STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

Tinh-Anh Nguyen-Nhu; Triet Dao Hoang Minh; Dat To-Thanh; Phuc Le-Gia; Tuan Vo-Lan; Tien-Huy Nguyen

arXiv:2508.13470·cs.CV·August 20, 2025

STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

Tinh-Anh Nguyen-Nhu, Triet Dao Hoang Minh, Dat To-Thanh, Phuc Le-Gia, Tuan Vo-Lan, Tien-Huy Nguyen

PDF

TL;DR

STERVLM is a resource-efficient vision-language framework that improves fine-grained spatio-temporal understanding for traffic analysis by combining caption decomposition, frame selection, reference-driven understanding, and prompt techniques.

Contribution

It introduces a novel, computationally efficient approach that enhances VLMs for traffic scene understanding through multiple innovative techniques.

Findings

01

Achieved a test score of 55.655 in AI City Challenge 2025 Track 2.

02

Demonstrated substantial improvements in semantic richness and scene interpretation.

03

Validated effectiveness on WTS and BDD datasets.

Abstract

Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.