Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo; Federico Tombari; Marc Pollefeys; Daniel Barath

arXiv:2506.04789·cs.CV·November 6, 2025

Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath

PDF

Open Access 1 Video

TL;DR

Object-X introduces a versatile multi-modal 3D object representation framework that encodes and decodes rich object data, enabling high-fidelity reconstructions and efficient downstream task performance with significantly reduced storage requirements.

Contribution

This work presents Object-X, a novel framework that unifies multi-modal object encoding and decoding into explicit geometric and visual reconstructions, improving versatility and efficiency over prior task-specific methods.

Findings

01

High-fidelity novel-view synthesis comparable to standard methods

02

Significant improvement in geometric accuracy

03

Requires 3-4 orders of magnitude less storage than traditional approaches

Abstract

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations· slideslive

Taxonomy

Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Advanced Vision and Imaging