RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers
Micha{\l} J. Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov,, Vittorio Ferrari

TL;DR
This paper introduces RayTran, a transformer-based neural network for multi-object 3D reconstruction from RGB videos, which efficiently combines 3D and 2D features to estimate object shapes and poses without tracking.
Contribution
It presents a novel end-to-end trainable architecture that exchanges information between 3D and 2D representations using bidirectional attention, improving 3D pose and shape estimation from videos.
Findings
Outperforms recent state-of-the-art methods on Scan2CAD dataset.
Efficiently reasons about scenes from multiple frames without tracking.
Achieves significant improvements in 3D object pose estimation accuracy.
Abstract
We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Advanced Vision and Imaging
