$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

Lingjie Zeng; Hailun Zhang; Xiwen Wang; Qijun Zhao

arXiv:2604.26461·cs.CV·April 30, 2026

$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

Lingjie Zeng, Hailun Zhang, Xiwen Wang, Qijun Zhao

PDF

TL;DR

PKS$^4$ introduces a novel linear-complexity, parallel kinematic state space scanning method that enhances efficiency and preserves spatial-temporal relationships in video understanding tasks.

Contribution

The paper proposes PKS$^4$, a new module combining a 2D backbone with parallel, kinematic prior-driven state space models for efficient, high-performance video analysis.

Findings

01

Achieves state-of-the-art results on action recognition benchmarks.

02

Converges in only 20 epochs with 10x less training compute.

03

Maintains spatial structure while reducing computational overhead.

Abstract

Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS $^{4}$ ). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS $^{4}$ module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.