A Saccaded Visual Transformer for General Object Spotting

Willem.T.Pye; David.A.Sinclair

arXiv:2210.09220·cs.CV·October 18, 2022

A Saccaded Visual Transformer for General Object Spotting

Willem.T.Pye, David.A.Sinclair

PDF

Open Access

TL;DR

This paper introduces a saccaded visual transformer that combines local attention with a novel training paradigm to efficiently locate objects, demonstrated on human faces.

Contribution

It presents a new model integrating saccaded attention with a transformer and a training method estimating object centroid distances instead of class probabilities.

Findings

01

Effective object centroid estimation on faces

02

Fast saccaded search enabled by the model

03

Built-in translational invariance

Abstract

This paper presents the novel combination of a visual transformer style patch classifier with saccaded local attention. A novel optimisation paradigm for training object models is also presented, rather than the optimisation function minimising class membership probability error the network is trained to estimate the normalised distance to the centroid of labelled objects. This approach builds a degree of transnational invariance directly into the model and allows fast saccaded search with gradient ascent to find object centroids. The resulting saccaded visual transformer is demonstrated on human faces.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques · Face and Expression Recognition