End-to-End Learning of Multi-category 3D Pose and Shape Estimation
Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, Luc Van Gool

TL;DR
This paper introduces an end-to-end Transformer-based approach for multi-category 3D pose and shape estimation from images, effectively handling occlusions and diverse object classes with improved accuracy.
Contribution
It presents a novel unified neural network that detects 2D keypoints and lifts them to 3D across multiple categories using visual context, trained only on 2D annotations.
Findings
Outperforms state-of-the-art on three benchmarks
Handles occlusions and multiple object categories
Uses only 2D keypoint annotations for training
Abstract
In this paper, we study the representation of the shape and pose of objects using their keypoints. Therefore, we propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D. The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations. In addition to being end-to-end from images to 3D keypoints, our method also handles objects from multiple categories using a single neural network. We use a Transformer-based architecture to detect the keypoints, as well as to summarize the visual context of the image. This visual context information is then used while lifting the keypoints to 3D, to allow context-based reasoning for better performance. Our method can handle occlusions as well as a wide variety of object classes. Our experiments on three benchmarks demonstrate that our method performs better than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
