Tracking People with 3D Representations
Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Jitendra, Malik

TL;DR
This paper introduces a 3D representation-based method for multi-person tracking in videos, demonstrating improved accuracy over traditional 2D approaches by using detailed 3D geometry and appearance features.
Contribution
The paper presents HMAR, a novel method that extracts 3D human meshes and textures for robust tracking, and demonstrates its effectiveness with state-of-the-art results.
Findings
3D representations outperform 2D in tracking accuracy
State-of-the-art performance on Posetrack, MuPoTs, AVA datasets
Robustness to viewpoint and pose variations
Abstract
We present a novel approach for tracking multiple people in video. Unlike past approaches which employ 2D representations, we focus on using 3D representations of people, located in three-dimensional space. To this end, we develop a method, Human Mesh and Appearance Recovery (HMAR) which in addition to extracting the 3D geometry of the person as a SMPL mesh, also extracts appearance as a texture map on the triangles of the mesh. This serves as a 3D representation for appearance that is robust to viewpoint and pose changes. Given a video clip, we first detect bounding boxes corresponding to people, and for each one, we extract 3D appearance, pose, and location information using HMAR. These embedding vectors are then sent to a transformer, which performs spatio-temporal aggregation of the representations over the duration of the sequence. The similarity of the resulting representations is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis
