Egocentric Scene Understanding via Multimodal Spatial Rectifier
Tien Do, Khiem Vuong, Hyun Soo Park

TL;DR
This paper introduces a multimodal spatial rectifier and a new dataset, EDINA, to improve egocentric scene understanding, specifically depth and surface normal prediction, addressing challenges from non-canonical viewpoints and dynamic foreground objects.
Contribution
The paper proposes a multimodal spatial rectifier for egocentric images and introduces the EDINA dataset, enabling better learning of dynamic scene representations and significantly improving depth and normal estimation.
Findings
Outperforms baseline models on EDINA, FPHA, and EPIC-KITCHENS datasets.
Effectively stabilizes egocentric images from non-canonical viewpoints.
Enhances depth and surface normal prediction accuracy.
Abstract
In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Hand Gesture Recognition Systems
MethodsGravity
