DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation
Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai, Qiang, Xu

TL;DR
DeciWatch introduces a lightweight, sparse sampling framework for video-based 2D/3D human pose estimation that significantly improves efficiency by processing only a small fraction of frames without sacrificing accuracy.
Contribution
It presents a novel sample-denoise-recover framework that leverages sparse frame sampling and Transformer architectures to enhance efficiency in pose estimation tasks.
Findings
Achieves 10x efficiency improvement over existing methods.
Maintains comparable accuracy with full-frame processing.
Validated on multiple datasets and tasks.
Abstract
This paper proposes a simple baseline framework for video-based 2D/3D human pose estimation that can achieve 10 times efficiency improvement over existing works without any performance degradation, named DeciWatch. Unlike current solutions that estimate each frame in a video, DeciWatch introduces a simple yet effective sample-denoise-recover framework that only watches sparsely sampled frames, taking advantage of the continuity of human motions and the lightweight pose representation. Specifically, DeciWatch uniformly samples less than 10% video frames for detailed estimation, denoises the estimated 2D/3D poses with an efficient Transformer architecture, and then accurately recovers the rest of the frames using another Transformer-based network. Comprehensive experimental results on three video-based human pose estimation and body mesh recovery tasks with four datasets validate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Robot Manipulation and Learning
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dense Connections · Residual Connection · Dropout · Layer Normalization
