TL;DR
This paper introduces a deep learning architecture for multi-person tracking that jointly detects visible and occluded body parts, leveraging a large synthetic dataset to improve accuracy in urban scene tracking.
Contribution
The authors propose a novel end-to-end network with four branches for joint detection and association of body parts, trained on the largest synthetic dataset for urban human tracking.
Findings
Model effectively detects occluded and visible joints.
Architecture generalizes well to real-world benchmarks.
Synthetic data enables robust tracking in urban scenarios.
Abstract
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
