ConcatPlexer: Additional Dim1 Batching for Faster ViTs
Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, and Nojun Kwak

TL;DR
ConcatPlexer introduces a novel dim1 batching technique for Vision Transformers, significantly reducing computational costs while maintaining high accuracy, thus enabling faster inference in visual recognition tasks.
Contribution
It adapts DataMUX for vision models and develops new components to optimize the balance between speed and accuracy in Vision Transformers.
Findings
Achieved 23.5% reduction in GFLOPs compared to ViT-B/16.
Maintained 69.5% accuracy on ImageNet1K.
Achieved 83.4% accuracy on CIFAR100.
Abstract
Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
