Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D   Pose Estimation Tracking and Forecasting on a Video Snippet

Shihao Zou; Yuanlu Xu; Chao Li; Lingni Ma; Li Cheng; Minh Vo

arXiv:2207.04320·cs.CV·September 14, 2023

Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Shihao Zou, Yuanlu Xu, Chao Li, Lingni Ma, Li Cheng, Minh Vo

PDF

Open Access 1 Repo

TL;DR

Snipper is a unified spatiotemporal transformer framework that simultaneously performs multi-person 3D pose estimation, tracking, and motion forecasting from video snippets, leveraging deformable attention for efficient information aggregation.

Contribution

The paper introduces Snipper, a novel single-stage transformer model that jointly addresses pose estimation, tracking, and forecasting, unlike prior methods that treat these tasks separately.

Findings

01

Rivals state-of-the-art methods in pose estimation, tracking, and forecasting.

02

Effective spatiotemporal feature encoding with deformable attention.

03

Achieves competitive results on three public datasets.

Abstract

Multi-person pose understanding from RGB videos involves three complex tasks: pose estimation, tracking and motion forecasting. Intuitively, accurate multi-person pose estimation facilitates robust tracking, and robust tracking builds crucial history for correct motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately, which tends to make sub-optimal decision at each stage and also fail to exploit correlations among the three tasks. In this paper, we propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage. We propose an efficient yet powerful deformable attention mechanism to aggregate spatiotemporal information from the video snippet. Building upon this deformable attention, a video transformer is learned to encode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jimmyzou/snipper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Video Analysis and Summarization