HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
Shuyi Ouyang, Hongyi Wang, Ziwei Niu, Zhenjia Bai, Shiao Xie, Yingying, Xu, Ruofeng Tong, Yen-Wei Chen, Lanfen Lin

TL;DR
HSVLT introduces a hierarchical, scale-aware vision-language transformer that enhances multi-label image classification by improving visual-linguistic alignment and multi-scale feature integration, achieving superior results efficiently.
Contribution
The paper proposes a novel hierarchical multi-scale architecture and an interactive attention mechanism for better multi-label classification performance.
Findings
Outperforms state-of-the-art methods on three benchmark datasets.
Achieves higher accuracy with lower computational cost.
Effectively recognizes objects of varying sizes and appearances.
Abstract
The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsByte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections
