ImagePiece: Content-aware Re-tokenization for Efficient Image   Recognition

Seungdong Yoa; Seungjun Lee; Hyeseung Cho; Bumsoo Kim; Woohyung Lim

arXiv:2412.16491·cs.CV·December 24, 2024

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim

PDF

Open Access

TL;DR

ImagePiece introduces a content-aware re-tokenization method for Vision Transformers, significantly speeding up inference while maintaining or improving accuracy by grouping semantically similar tokens.

Contribution

The paper proposes a novel re-tokenization strategy, ImagePiece, that enhances token reduction in ViTs by grouping semantically insufficient tokens, improving speed and accuracy.

Findings

01

Increases DeiT-S inference speed by 54%.

02

Achieves over 8% accuracy gain at 251% acceleration.

03

Compatible with existing token reduction methods.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. Nevertheless, since ViT tokens are generated from non-overlapping grid patches, they usually do not convey sufficient semantics, making it incompatible with efficient ViTs. To address this, we propose ImagePiece, a novel re-tokenization strategy for Vision Transformers. Following the MaxMatch strategy of NLP tokenization, ImagePiece groups semantically insufficient yet locally coherent tokens until they convey meaning. This simple retokenization is highly compatible with previous token reduction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · AI in cancer detection · Generative Adversarial Networks and Image Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings