Revisiting Vision Transformer from the View of Path Ensemble
Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou

TL;DR
This paper presents a novel perspective on Vision Transformers as multi-path ensemble networks, introduces path pruning and re-weighting techniques, and demonstrates improved performance and deeper architectures through these strategies.
Contribution
It reinterprets ViTs as ensemble networks with multiple paths, proposes path pruning and ensemble re-weighting methods, and enhances ViT depth and performance via these new strategies.
Findings
Path pruning improves accuracy by removing underperforming paths.
Re-weighting ensemble components enhances feature representation.
Path strategies enable ViTs to be deeper and filter low-frequency signals.
Abstract
Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsPruning · Focus
