Single-image coherent reconstruction of objects and humans
Sarthak Batra, Partha P. Chakrabarti, Simon Hadfield, Armin Mustafa

TL;DR
This paper presents a novel method for reconstructing coherent 3D models of objects and humans from a single image, effectively handling occlusions and interactions without scene-level supervision.
Contribution
It introduces a collision-aware optimization framework and a robust 6-DOF pose estimation technique for heavily occluded objects, improving scene coherence in monocular reconstructions.
Findings
Significant reduction in mesh collisions compared to existing methods
Effective reconstruction of scenes with multiple interacting humans and objects
Operates on real-world images without scene or object-level 3D supervision
Abstract
Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a…
Peer Reviews
Decision·Submitted to ICLR 2024
+ The method handles an important problem in computer vision. To this end, just a single image is used and no 3D supervision is considered. In contrast, a novel collision loss is exploited. + The segmentation mask is improved by a well-known inpainting-based approach. Thanks to that, the precision of 6 d.o.f object position in heavily occluded scenes is better.
- The full method seems to be a good combination of well-known approaches in the literature. In this line, I feel the authors should explain better their real contribution with respect to state of the art. Right now, the contribution seems to be minor, according to the information in the document. - Some important analysis and experiments are missing in the paper. - Some claims are not properly validated. The quantitative analysis shows that some of them were not solved as proposed.
It can leverage most off-the-shelf inpainting and segmentation models and does not require extra data or 3D supervision. Well motivated. Reasoning about 3D geometry and affordances can improve most existing methods; it can also be potentially extended to generate targets for reinforcement or unsupervised learning.
Not enough experimental validation. It is possible I misunderstood, but there seems to be no comparison with more reasonable baselines like silhouette fitting without inpainting, or another human-human interaction method [Jiang et al. 2020]. While the statement that the inpainting approach "greatly boosts the precision of 6 DOF object position estimations" may not technically be a misrepresentation, I believe it is confusing. The actual evaluations show mesh collision metrics and user preferenc
This paper focuses on addressing occlusions for better human and object pose estimation. The major strength mainly comes from the human-object and human-human occlusion losses. However, many losses are inspired by the baseline [1]. Another strength is the mask inpainting for better 6 DOF pose estimation. [1] Coherent reconstruction of multiple humans from a single image, CVPR2022
Even though this paper achieves convincing visualizations, there are quite a lot of weaknesses. 1. The technical contributions are not enough The authors claim that they can address occlusion issues in many circumstances, e.g., human-human and human-object occlusion. However, the major methodology (loss function) to achieve their target is heavily inspired/designed based on existing works, e.g. interaction loss and depth-ordering loss from [1]. 2. Overclaimed contributions For the first contr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced X-ray Imaging Techniques · Digital Holography and Microscopy · Advanced Optical Sensing Technologies
