Peeling the Onion: Hierarchical Reduction of Data Redundancy for   Efficient Vision Transformer Training

Zhenglun Kong; Haoyu Ma; Geng Yuan; Mengshu Sun; Yanyue Xie; Peiyan; Dong; Xin Meng; Xuan Shen; Hao Tang; Minghai Qin; Tianlong Chen; Xiaolong Ma,; Xiaohui Xie; Zhangyang Wang; Yanzhi Wang

arXiv:2211.10801·cs.CV·November 22, 2022

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

Zhenglun Kong, Haoyu Ma, Geng Yuan, Mengshu Sun, Yanyue Xie, Peiyan, Dong, Xin Meng, Xuan Shen, Hao Tang, Minghai Qin, Tianlong Chen, Xiaolong Ma,, Xiaohui Xie, Zhangyang Wang, Yanzhi Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Tri-Level E-ViT, a hierarchical data redundancy reduction framework that accelerates Vision Transformer training by exploiting sparsity at multiple levels, often improving accuracy while reducing training time.

Contribution

It presents a novel end-to-end training method that reduces data redundancy across three hierarchical levels, enhancing efficiency without sacrificing accuracy.

Findings

01

Achieves up to 15.7% training speedup on ViT models.

02

Maintains or slightly improves Top-1 accuracy during acceleration.

03

Demonstrates the existence of significant data redundancy in ViT training.

Abstract

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zlkong/tri-level-vit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Fusion Techniques · CCD and CMOS Imaging Sensors · Image Enhancement Techniques