A Light Touch Approach to Teaching Transformers Multi-view Geometry

Yash Bhalgat; Joao F. Henriques; Andrew Zisserman

arXiv:2211.15107·cs.CV·April 4, 2023

A Light Touch Approach to Teaching Transformers Multi-view Geometry

Yash Bhalgat, Joao F. Henriques, Andrew Zisserman

PDF

Open Access

TL;DR

This paper introduces a 'light touch' method guiding visual Transformers with epipolar lines to improve multi-view geometry understanding, enhancing object retrieval without requiring camera pose info at test-time.

Contribution

It proposes a novel approach that guides Transformers using epipolar lines, allowing flexible learning of geometry while maintaining geometric constraints, without needing pose data during testing.

Findings

01

Outperforms state-of-the-art object retrieval methods

02

Does not require camera pose information at test-time

03

Improves pose-invariant retrieval accuracy

Abstract

Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Human Pose and Action Recognition

MethodsAttention Is All You Need · Layer Normalization · Softmax · Adam · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Linear Layer