Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Serin Varghese; Kevin Ross; Fabian Hueger; Kira Maag

arXiv:2602.10052·cs.CV·March 23, 2026

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag

PDF

Open Access

TL;DR

This paper introduces a Spatio-Temporal Attention mechanism that enhances transformer-based video semantic segmentation by leveraging multi-frame context, significantly improving temporal consistency and accuracy in dynamic driving scenes.

Contribution

It proposes a novel Spatio-Temporal Attention module that extends transformers to incorporate multi-frame information with minimal architectural changes.

Findings

01

9.20 percentage points improvement in temporal consistency

02

Up to 1.76 percentage points increase in mean IoU

03

Effective across various transformer architectures

Abstract

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition