MOHO: Learning Single-view Hand-held Object Reconstruction with   Multi-view Occlusion-Aware Supervision

Chenyangguang Zhang; Guanlong Jiao; Yan Di; Gu Wang; Ziqin Huang,; Ruida Zhang; Fabian Manhardt; Bowen Fu; Federico Tombari; Xiangyang Ji

arXiv:2310.11696·cs.CV·March 14, 2024·1 cites

MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

Chenyangguang Zhang, Guanlong Jiao, Yan Di, Gu Wang, Ziqin Huang,, Ruida Zhang, Fabian Manhardt, Bowen Fu, Federico Tombari, Xiangyang Ji

PDF

Open Access

TL;DR

MOHO introduces a novel framework that leverages multi-view occlusion-aware supervision from hand-object videos to improve single-view hand-held object reconstruction, effectively handling occlusions without relying on 3D ground-truth models.

Contribution

The paper proposes a synthetic-to-real training framework that uses synthetic multi-view data and amodal masks to overcome occlusion challenges in real-world single-view object reconstruction.

Findings

01

MOHO outperforms 3D-supervised methods on HO3D and DexYCB datasets.

02

Synthetic pre-training with multi-view supervision enhances real-world reconstruction.

03

Domain-aware features improve handling of self-occlusion in objects.

Abstract

Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models, which are hard to collect in real world. In contrast, readily accessible hand-object videos offer a promising training data source, but they only give heavily occluded object observations. In this paper, we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image, tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First, in the synthetic pre-training stage, we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions, adopted to address hand-induced occlusion in both 2D and 3D spaces. Second, in the real-world finetuning stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Face recognition and analysis