Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Shahla John

arXiv:2507.22421·cs.CV·July 31, 2025

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Shahla John

PDF

TL;DR

This paper introduces a unified spatial-temporal framework for real-time video analysis that improves accuracy and speed in action recognition and object tracking through hierarchical attention mechanisms.

Contribution

The work presents a novel hierarchical attention-based model that enhances real-time spatial-temporal analysis for multiple video understanding tasks.

Findings

01

Achieves state-of-the-art accuracy on UCF-101 and HMDB-51.

02

Improves tracking precision on MOT17 dataset.

03

Reduces inference time by 40% compared to previous methods.

Abstract

Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.