Single-image coherent reconstruction of objects and humans

Sarthak Batra; Partha P. Chakrabarti; Simon Hadfield; Armin Mustafa

arXiv:2408.08086·cs.CV·August 16, 2024

Single-image coherent reconstruction of objects and humans

Sarthak Batra, Partha P. Chakrabarti, Simon Hadfield, Armin Mustafa

PDF

Open Access 3 Reviews

TL;DR

This paper presents a novel method for reconstructing coherent 3D models of objects and humans from a single image, effectively handling occlusions and interactions without scene-level supervision.

Contribution

It introduces a collision-aware optimization framework and a robust 6-DOF pose estimation technique for heavily occluded objects, improving scene coherence in monocular reconstructions.

Findings

01

Significant reduction in mesh collisions compared to existing methods

02

Effective reconstruction of scenes with multiple interacting humans and objects

03

Operates on real-world images without scene or object-level 3D supervision

Abstract

Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 5

Strengths

+ The method handles an important problem in computer vision. To this end, just a single image is used and no 3D supervision is considered. In contrast, a novel collision loss is exploited. + The segmentation mask is improved by a well-known inpainting-based approach. Thanks to that, the precision of 6 d.o.f object position in heavily occluded scenes is better.

Weaknesses

- The full method seems to be a good combination of well-known approaches in the literature. In this line, I feel the authors should explain better their real contribution with respect to state of the art. Right now, the contribution seems to be minor, according to the information in the document. - Some important analysis and experiments are missing in the paper. - Some claims are not properly validated. The quantitative analysis shows that some of them were not solved as proposed.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

It can leverage most off-the-shelf inpainting and segmentation models and does not require extra data or 3D supervision. Well motivated. Reasoning about 3D geometry and affordances can improve most existing methods; it can also be potentially extended to generate targets for reinforcement or unsupervised learning.

Weaknesses

Not enough experimental validation. It is possible I misunderstood, but there seems to be no comparison with more reasonable baselines like silhouette fitting without inpainting, or another human-human interaction method [Jiang et al. 2020]. While the statement that the inpainting approach "greatly boosts the precision of 6 DOF object position estimations" may not technically be a misrepresentation, I believe it is confusing. The actual evaluations show mesh collision metrics and user preferenc

Reviewer 03Rating 3· reject, not good enoughConfidence 2

Strengths

This paper focuses on addressing occlusions for better human and object pose estimation. The major strength mainly comes from the human-object and human-human occlusion losses. However, many losses are inspired by the baseline [1]. Another strength is the mask inpainting for better 6 DOF pose estimation. [1] Coherent reconstruction of multiple humans from a single image, CVPR2022

Weaknesses

Even though this paper achieves convincing visualizations, there are quite a lot of weaknesses. 1. The technical contributions are not enough The authors claim that they can address occlusion issues in many circumstances, e.g., human-human and human-object occlusion. However, the major methodology (loss function) to achieve their target is heavily inspired/designed based on existing works, e.g. interaction loss and depth-ordering loss from [1]. 2. Overclaimed contributions For the first contr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced X-ray Imaging Techniques · Digital Holography and Microscopy · Advanced Optical Sensing Technologies