DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Shuo Cao; Yihao Liu; Xiaohui Li; Yuanting Gao; Yu Zhou; Chao Dong

arXiv:2506.04830·cs.CV·June 16, 2025

DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Shuo Cao, Yihao Liu, Xiaohui Li, Yuanting Gao, Yu Zhou, Chao Dong

PDF

Open Access

TL;DR

DualX-VSR introduces a dual axial attention mechanism that effectively models spatiotemporal information without motion compensation, leading to improved real-world video super-resolution performance.

Contribution

It proposes a novel dual axial spatial×temporal transformer that eliminates the need for motion compensation in real-world VSR.

Findings

01

Achieves high fidelity in real-world VSR tasks.

02

Outperforms existing transformer-based VSR models.

03

Provides a simplified, motion-compensation-free architecture.

Abstract

Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial $\times$ Temporal Transformer for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Advanced Vision and Imaging · Image and Video Quality Assessment

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · TimeSformer