TL;DR
This paper introduces CoCoNets, a self-supervised method for learning continuous 3D scene representations from RGB and RGB-D data, enabling improved 3D understanding tasks like object tracking and detection.
Contribution
It presents a novel contrastive learning framework for amodal 3D feature representations that are scalable, occlusion-aware, and effective across various scene understanding tasks.
Findings
Outperforms existing 3D feature learning methods
Enables querying of any 3D location, visible or not
Improves object tracking and detection accuracy
Abstract
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points, where each continuous world 3D point is mapped to its corresponding feature vector. The model is trained for contrastive view prediction by rendering 3D feature clouds in queried viewpoints and matching against the 3D feature point cloud predicted from the query view. Notably, the representation can be queried for any 3D location, even if it is not visible from the input view. Our model brings together three powerful ideas of recent exciting research work: 3D feature grids as a neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
