Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification
Haibin Jiao

TL;DR
This paper introduces GCN-HViT, a hierarchical vision transformer enhanced with graph convolutional networks, to better model local and global image patch relationships for improved image classification accuracy.
Contribution
The paper proposes a novel hierarchical vision transformer combined with GCN to effectively capture local and global spatial relationships in images.
Findings
Achieves state-of-the-art performance on three real-world datasets.
Effectively models hierarchical relationships between patches at multiple levels.
Utilizes GCN to incorporate local spatial information into the transformer.
Abstract
Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
