Multi-View Foundation Models

Leo Segre; Or Hirschorn; Shai Avidan

arXiv:2512.15708·cs.CV·December 18, 2025

Multi-View Foundation Models

Leo Segre, Or Hirschorn, Shai Avidan

PDF

Open Access

TL;DR

This paper introduces a method to extend foundation models into multi-view models by adding 3D-aware attention layers, enabling consistent feature extraction across multiple images of the same scene, improving tasks like segmentation and normal estimation.

Contribution

The paper presents a novel approach to convert existing foundation models into multi-view models with 3D-aware attention, enhancing feature consistency across views without building explicit 3D models.

Findings

01

Improved feature matching accuracy across views.

02

Enhanced performance in surface normal estimation.

03

Better multi-view segmentation results.

Abstract

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Medical Image Segmentation Techniques