MessyKitchens: Contact-rich object-level 3D scene reconstruction
Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev

TL;DR
This paper introduces MessyKitchens, a new dataset and method for physically plausible 3D scene reconstruction at the object level, improving accuracy and contact modeling in cluttered environments.
Contribution
The paper presents a new dataset with high-fidelity object-level ground truth and extends the SAM 3D approach with Multi-Object Decoder for improved scene reconstruction.
Findings
MessyKitchens dataset outperforms previous datasets in registration accuracy.
Multi-Object Decoder (MOD) significantly improves multi-object scene reconstruction.
Approach demonstrates consistent improvements across multiple datasets.
Abstract
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Neural Network Applications · Human Pose and Action Recognition
