Multi-View Foundation Models
Leo Segre, Or Hirschorn, Shai Avidan

TL;DR
This paper introduces a method to extend foundation models into multi-view models by adding 3D-aware attention layers, enabling consistent feature extraction across multiple images of the same scene, improving tasks like segmentation and normal estimation.
Contribution
The paper presents a novel approach to convert existing foundation models into multi-view models with 3D-aware attention, enhancing feature consistency across views without building explicit 3D models.
Findings
Improved feature matching accuracy across views.
Enhanced performance in surface normal estimation.
Better multi-view segmentation results.
Abstract
Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Medical Image Segmentation Techniques
