Glance and Focus Networks for Dynamic Visual Recognition
Gao Huang, Yulin Wang, Kangchen Lv, Haojun Jiang, Wenhui Huang,, Pengfei Qi, Shiji Song

TL;DR
The paper introduces GFNet, a sequential coarse-to-fine visual recognition model that adaptively attends to salient regions, reducing redundant computation and improving efficiency without sacrificing accuracy.
Contribution
GFNet formulates region localization as reinforcement learning, enabling adaptive inference and compatibility with various backbone models for efficient visual recognition.
Findings
Reduces MobileNet-V3 latency by 1.3x on iPhone XS Max
Achieves comparable accuracy with less computation
Demonstrates effectiveness on image and video recognition tasks
Abstract
Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
