Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins; Martin R. Oswald; Javier Civera

arXiv:2604.12551·cs.CV·April 15, 2026

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

PDF

TL;DR

This paper introduces CAMFusion, a multiview transformer that cross-attends to vision-language descriptors from multiple viewpoints, improving 3D scene understanding and achieving state-of-the-art results.

Contribution

It proposes a novel cross-attentive multiview fusion architecture and leverages multiview consistency as self-supervision, advancing 3D semantic segmentation.

Findings

01

Outperforms naive averaging and single-view methods

02

Achieves state-of-the-art on 3D classification benchmarks

03

Improves zero-shot out-of-domain performance

Abstract

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.