Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing
Yuang Liu, Zhiheng Qiu, Xiaokai Qin

TL;DR
Nested-TNT introduces a hierarchical vision transformer that processes multi-scale features through nested algorithms, leading to improved image classification performance over existing models like ViT and TNT.
Contribution
The paper proposes a novel nested hierarchical transformer architecture that enhances multi-scale feature processing for vision tasks, surpassing prior models in accuracy.
Findings
Outperforms ViT and TNT on CIFAR10 and FLOWERS102 datasets.
Achieves over 2% improvement in classification accuracy.
Demonstrates better feature utilization through nested multi-scale processing.
Abstract
Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image and Object Detection Techniques
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam
