Learning Implicit Representation for Reconstructing Articulated Objects
Hao Zhang, Fang Li, Samyak Rawlekar, and Narendra Ahuja

TL;DR
This paper presents a category-agnostic method for 3D reconstruction of articulated objects from video, estimating both explicit shapes and implicit skeletal structures without prior 3D supervision.
Contribution
It introduces a novel approach that jointly estimates explicit 3D shapes and implicit skeletal representations from motion cues, generalizing beyond category-specific models.
Findings
Outperforms state-of-the-art methods on standard datasets
Eliminates need for category-specific skeletal models
Successfully reconstructs articulated objects in the wild
Abstract
3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients,…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper proposes a novel method for the important task of reconstructing moving articulated object from monocular videos. The proposed joint explicit and implicit representations seem effective in modeling both canonical structure and pose-dependent deformation. 2. Experiments have been conducted for a fair comparison with state-of-the-arts (i.e., LASR and BANMo) that do not take ground truth skeletons. The experiments include both qualitative (geometry and appearance) and quantitative (2
1. More visualizations such as video comparisons like those shown in LASR and BANMo would be more intuitive and straightforward to show the object in the move/motion. 2. The proposed system includes multiple hyper-parameters and multiple separate but interdependent steps. Given the complications of the current system, it is unclear how robust this method is, eg, with regard to initialization, hyper-parameter settings, input video contents, etc.
- The proposed method somehow worked on DAVIS to reconstruct Quadrupeds and the Human body. - The idea of learning skinning and differentiable skeletons is interesting, this may inspire other deformation representations and general dynamic scene modeling. But not in this task (see weakness)
- The main concern lies in the necessity and value of the task in the current literature. There are two aspects to argue this: 1.) The reviewer guesses that given the current SoTA and technology in the community, the best way to model semi-nonrigid objects presented in this paper is to use template-based models. This paper presents animals and humans, which already have good template models. Only when the object motion structure differs a lot, and lacks a good template model, do we need some “u
- The proposed method does not require category-specific pre-training, unlike MagicPony and BANMo. - The proposed method demonstrates significant qualitative improvements over prior works. - The proposed method incorporates a novel mechanism for adaptively learning the optimal number of skeleton joints. - Extensive ablation studies are conducted.
**Major** - Missing ablation of considering optical flow visibility (Eq. 4) and Laplacian contraction. - The effectiveness of the rigidity coefficient/dynamic rigid does not seem substantial from Fig. 9 and Table 7. - The most recent method, MagicPony, also employs implicit and explicit representation and strongly relates to the proposed work, yet a direct comparison in the experiment is missing. **Minor** - Although as a post-processing step, WIM [1] also infers the skeleton with a variable nu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Advanced Vision and Imaging
