Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification

Haibin Jiao

arXiv:2604.16823·cs.CV·April 21, 2026

Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification

Haibin Jiao

PDF

TL;DR

This paper introduces GCN-HViT, a hierarchical vision transformer enhanced with graph convolutional networks, to better model local and global image patch relationships for improved image classification accuracy.

Contribution

The paper proposes a novel hierarchical vision transformer combined with GCN to effectively capture local and global spatial relationships in images.

Findings

01

Achieves state-of-the-art performance on three real-world datasets.

02

Effectively models hierarchical relationships between patches at multiple levels.

03

Utilizes GCN to incorporate local spatial information into the transformer.

Abstract

Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.