A Saccaded Visual Transformer for General Object Spotting
Willem.T.Pye, David.A.Sinclair

TL;DR
This paper introduces a saccaded visual transformer that combines local attention with a novel training paradigm to efficiently locate objects, demonstrated on human faces.
Contribution
It presents a new model integrating saccaded attention with a transformer and a training method estimating object centroid distances instead of class probabilities.
Findings
Effective object centroid estimation on faces
Fast saccaded search enabled by the model
Built-in translational invariance
Abstract
This paper presents the novel combination of a visual transformer style patch classifier with saccaded local attention. A novel optimisation paradigm for training object models is also presented, rather than the optimisation function minimising class membership probability error the network is trained to estimate the normalised distance to the centroid of labelled objects. This approach builds a degree of transnational invariance directly into the model and allows fast saccaded search with gradient ascent to find object centroids. The resulting saccaded visual transformer is demonstrated on human faces.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques · Face and Expression Recognition
