Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations
Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath

TL;DR
Object-X introduces a versatile multi-modal 3D object representation framework that encodes and decodes rich object data, enabling high-fidelity reconstructions and efficient downstream task performance with significantly reduced storage requirements.
Contribution
This work presents Object-X, a novel framework that unifies multi-modal object encoding and decoding into explicit geometric and visual reconstructions, improving versatility and efficiency over prior task-specific methods.
Findings
High-fidelity novel-view synthesis comparable to standard methods
Significant improvement in geometric accuracy
Requires 3-4 orders of magnitude less storage than traditional approaches
Abstract
Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
