Detect-and-Track: Efficient Pose Estimation in Videos
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri and, Du Tran

TL;DR
This paper introduces a lightweight, two-stage method for human pose estimation and tracking in videos, combining frame-based keypoint detection with temporal tracking to improve accuracy and efficiency.
Contribution
The paper presents a novel two-stage approach that integrates frame-level pose estimation with lightweight tracking, utilizing a 3D extension of Mask R-CNN for enhanced robustness.
Findings
Achieves 55.2% MOTA on PoseTrack validation set
State-of-the-art performance on ICCV 2017 PoseTrack challenge
Effective use of temporal information for improved pose estimation
Abstract
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
MethodsRegion Proposal Network · Softmax · RoIAlign · Convolution · Mask R-CNN
