Cross-Attentive Multiview Fusion of Vision-Language Embeddings
Tomas Berriel Martins, Martin R. Oswald, Javier Civera

TL;DR
This paper introduces CAMFusion, a multiview transformer that cross-attends to vision-language descriptors from multiple viewpoints, improving 3D scene understanding and achieving state-of-the-art results.
Contribution
It proposes a novel cross-attentive multiview fusion architecture and leverages multiview consistency as self-supervision, advancing 3D semantic segmentation.
Findings
Outperforms naive averaging and single-view methods
Achieves state-of-the-art on 3D classification benchmarks
Improves zero-shot out-of-domain performance
Abstract
Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
