CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro, Armeni

TL;DR
CrossOver introduces a flexible, scene-level cross-modal alignment framework for 3D scene understanding, enabling robust retrieval and localization across multiple modalities without requiring complete data or explicit object semantics.
Contribution
It proposes a novel, modality-agnostic embedding space for 3D scenes that relaxes data alignment constraints and supports diverse modalities and missing data scenarios.
Findings
Outperforms existing methods on ScanNet and 3RScan datasets
Supports robust scene retrieval and object localization with missing modalities
Demonstrates emergent cross-modal behaviors in learned embeddings
Abstract
Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities -- RGB images, point clouds, CAD models, floorplans, and text descriptions -- with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
