HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for   Multi-Label Image Classification

Shuyi Ouyang; Hongyi Wang; Ziwei Niu; Zhenjia Bai; Shiao Xie; Yingying; Xu; Ruofeng Tong; Yen-Wei Chen; Lanfen Lin

arXiv:2407.16244·cs.CV·July 24, 2024

HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification

Shuyi Ouyang, Hongyi Wang, Ziwei Niu, Zhenjia Bai, Shiao Xie, Yingying, Xu, Ruofeng Tong, Yen-Wei Chen, Lanfen Lin

PDF

TL;DR

HSVLT introduces a hierarchical, scale-aware vision-language transformer that enhances multi-label image classification by improving visual-linguistic alignment and multi-scale feature integration, achieving superior results efficiently.

Contribution

The paper proposes a novel hierarchical multi-scale architecture and an interactive attention mechanism for better multi-label classification performance.

Findings

01

Outperforms state-of-the-art methods on three benchmark datasets.

02

Achieves higher accuracy with lower computational cost.

03

Effectively recognizes objects of varying sizes and appearances.

Abstract

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic interactions play a vital role in improving classification performance. Moreover, given the potential variance in object size and appearance within a single image, attention to features of different scales can help to discover possible objects in the image. Recently, Transformer-based methods have achieved great success in multi-label image classification by leveraging the advantage of modeling long-range dependencies, but they have several limitations. Firstly, existing methods treat visual feature extraction and cross-modal fusion as separate steps, resulting in insufficient visual-linguistic alignment in the joint semantic space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsByte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections