SHARCS: Efficient Transformers through Routing with Dynamic Width   Sub-networks

Mohammadreza Salehi; Sachin Mehta; Aditya Kusupati; Ali Farhadi,; Hannaneh Hajishirzi

arXiv:2310.12126·cs.LG·October 19, 2023·1 cites

SHARCS: Efficient Transformers through Routing with Dynamic Width Sub-networks

Mohammadreza Salehi, Sachin Mehta, Aditya Kusupati, Ali Farhadi,, Hannaneh Hajishirzi

PDF

Open Access

TL;DR

SHARCS is a method for adaptive inference in transformers that dynamically routes samples to sub-networks of different widths, improving efficiency and accuracy across various tasks and architectures.

Contribution

It introduces a trainable router for transformers that enables dynamic sub-network selection based on input difficulty, enhancing efficiency and performance.

Findings

01

SHARCS outperforms existing adaptive inference methods in accuracy vs. FLOPs.

02

It generalizes across different transformer architectures and compressed models.

03

SHARCS achieves approximately 2x inference speedup with minimal accuracy loss.

Abstract

We introduce SHARCS for adaptive inference that takes into account the hardness of input samples. SHARCS can train a router on any transformer network, enabling the model to direct different samples to sub-networks with varying widths. Our experiments demonstrate that: (1) SHARCS outperforms or complements existing per-sample adaptive inference methods across various classification tasks in terms of accuracy vs. FLOPs; (2) SHARCS generalizes across different architectures and can be even applied to compressed and efficient transformer encoders to further improve their efficiency; (3) SHARCS can provide a 2 times inference speed up at an insignificant drop in accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings