ConcatPlexer: Additional Dim1 Batching for Faster ViTs

Donghoon Han; Seunghyeon Seo; Donghyeon Jeon; Jiho Jang; Chaerin Kong; and Nojun Kwak

arXiv:2308.11199·cs.CV·February 1, 2024

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, and Nojun Kwak

PDF

Open Access

TL;DR

ConcatPlexer introduces a novel dim1 batching technique for Vision Transformers, significantly reducing computational costs while maintaining high accuracy, thus enabling faster inference in visual recognition tasks.

Contribution

It adapts DataMUX for vision models and develops new components to optimize the balance between speed and accuracy in Vision Transformers.

Findings

01

Achieved 23.5% reduction in GFLOPs compared to ViT-B/16.

02

Maintained 69.5% accuracy on ImageNet1K.

03

Achieved 83.4% accuracy on CIFAR100.

Abstract

Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings