TL;DR
This paper investigates the internal mechanisms of multi-view transformers like DUSt3R in 3D vision, providing insights into their representations, layer roles, and differences from models with explicit pose biases, enhancing interpretability.
Contribution
It introduces a probing and visualization approach to understand 3D representations in multi-view transformers, revealing their development and differences from other methods.
Findings
Probes and visualizes 3D representations in multi-view transformers.
Reveals how latent states develop across transformer layers.
Shows the model estimates correspondences refined with geometry reconstruction.
Abstract
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
