TL;DR
This paper introduces a multi-view, self-supervised framework for human detection and segmentation that leverages geometric constraints and operates on single images at inference, outperforming existing methods especially in challenging scenarios.
Contribution
It presents a novel multi-camera approach with geometric consistency for self-supervised human detection and segmentation, effective even with dynamic scenes and camera motion.
Findings
Outperforms state-of-the-art on challenging images and Human3.6M dataset.
Operates effectively on single RGB images at inference.
Utilizes multi-view geometric constraints during training.
Abstract
Self-supervised detection and segmentation of foreground objects aims for accuracy without annotated training data. However, existing approaches predominantly rely on restrictive assumptions on appearance and motion. For scenes with dynamic activities and camera motion, we propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training via coarse 3D localization in a voxel grid and fine-grained offset regression. In this manner, we learn a joint distribution of proposals over multiple views. At inference time, our method operates on single RGB images. We outperform state-of-the-art techniques both on images that visually depart from those of standard benchmarks and on those of the classical Human3.6M dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
