GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field

Zhizhuo Pang; Zhihui Ke; Xiaobo Zhou; Tie Qiu

arXiv:2507.05594·cs.CV·July 9, 2025

GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field

Zhizhuo Pang, Zhihui Ke, Xiaobo Zhou, Tie Qiu

PDF

Open Access

TL;DR

GSVR introduces a fast, Gaussian-based video representation that achieves over 800 FPS with high quality, significantly reducing training and decoding times compared to existing neural video methods.

Contribution

The paper proposes a novel 2D Gaussian-based video representation with hybrid deformation fields and adaptive slicing, enabling rapid training and decoding while maintaining high quality.

Findings

01

Achieves 800+ FPS with 35+ PSNR on Bunny dataset.

02

Decodes 10x faster than existing methods.

03

Requires only 2 seconds per frame for training.

Abstract

Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings