An attention-driven hierarchical multi-scale representation for visual recognition
Zachary Wharton, Ardhendu Behera, Asish Bera

TL;DR
This paper introduces an attention-driven hierarchical multi-scale representation using GCNs to capture long-range dependencies in images, significantly improving fine-grained and generic visual recognition tasks.
Contribution
It proposes a novel GCN-based method with attention-driven message propagation for hierarchical multi-scale feature modeling in visual recognition.
Findings
Outperforms state-of-the-art on three datasets
Effective for fine-grained classification
Competitive on additional datasets
Abstract
Convolutional Neural Networks (CNNs) have revolutionized the understanding of visual content. This is mainly due to their ability to break down an image into smaller pieces, extract multi-scale localized features and compose them to construct highly expressive representations for decision making. However, the convolution operation is unable to capture long-range dependencies such as arbitrary relations between pixels since it operates on a fixed-size window. Therefore, it may not be suitable for discriminating subtle changes (e.g. fine-grained visual recognition). To this end, our proposed method captures the high-level long-range dependencies by exploring Graph Convolutional Networks (GCNs), which aggregate information by establishing relationships among multi-scale hierarchical regions. These regions consist of smaller (closer look) to larger (far look), and the dependency between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
MethodsConvolution
