TL;DR
This paper introduces TS-Mamba, an online video super-resolution method that models long-term trajectories and aggregates spatio-temporal information efficiently, achieving state-of-the-art results with reduced computational complexity.
Contribution
The paper proposes a novel trajectory-aware shifted state space model (TS-Mamba) that enhances long-term temporal modeling and efficiency in online video super-resolution.
Findings
Achieves state-of-the-art performance on three VSR datasets.
Reduces computational complexity by over 22.7% in MACs.
Effectively models long-term trajectories for improved super-resolution.
Abstract
Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to…
Peer Reviews
Decision·ICLR 2026 Poster
1. Performance: The model achieves superior performance in terms of PSNR/SSIM and visual quality across multiple benchmark datasets (REDS, Vid4, Vimeo-90K-T) and degradation types (BI and BD), demonstrating its robustness in real-world video restoration scenarios. 2. Computational Efficiency: TS-Mamba successfully reduces complexity by 22.7% in terms of MACs compared to existing methods, making it a strong candidate for real-time online VSR applications. The model is also one of the fastest amo
1. Lack of Comparison. The proposed scheme fails to discuss or compare with several recent restoration schemes that leverage state-space models (SSMs, e.g. Mamba) for superior performance. For instance, MambaIR and MambaIRv2 introduced a residual Mamba-based backbone (with convolution and channel attention) to capture global dependencies in image super-resolution and denoising, outperforming a SwinIR Transformer baseline. The more recent TAMambaIR improves efficiency by modulating the state-spac
1. The paper comprehensively discusses the related works in online video super-resolution and points out the existing limitations of the existing methods. 2. TS-Mamba solely utilizes past frames for online video super-resolution and aggregates long-range information through trajectory-aware token selection, which motion paths across multiple previous frames. 3. The proposed method is efficient without sacrificing quality, leading to a better trade-off between reconstruction quality and processin
1. what is $$v_{\tau_{i}^{h_{j}}}? Are they math typos in Eq.(7) and Eq.(8)? 2. The method proposes a trajectory-aware method and define a temporal trajectory among video frames in Eq.(3). However, I am confused on this definition. It is better to elaborate more on what positions among video frames belong to the same trajectory. 3. The method introduces a dual-path block, i.e., intra-window compensation branch and inter-window compensation branch. What is the key difference between two blocks? W
The integration of Mamba-based spatial modeling with trajectory-aware design is conceptually clear and implemented in a well-structured manner. The proposed Trajectory-aware Shifted Mamba Aggregation (TSMA) module is technically sound and engineering-wise elegant. The model achieves reasonable complexity-efficiency trade-offs — it incorporates long-term temporal modeling while maintaining relatively low MACs and parameter counts compared to prior online VSR baselines.
1. Biased motivation. The paper claims that “most existing VSR methods solely employ one neighboring previous frame”, yet this statement overlooks numerous recent works that already explore long-range temporal modeling. For instance: [1] Wang, Xijun, et al. LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4×RTX 4090s. arXiv:2506.08529 (2025). [2] Liu, Yong, et al. UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Ste
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
