Consistent Accelerated Inference via Confident Adaptive Transformers
Tal Schuster, Adam Fisch, Tommi Jaakkola, Regina Barzilay

TL;DR
This paper introduces Confident Adaptive Transformers (CATs), a method that accelerates inference in large Transformers by dynamically stopping computation while ensuring high confidence in output consistency.
Contribution
The paper proposes a novel training and stopping mechanism for Transformers that guarantees output consistency with high confidence, improving efficiency without sacrificing reliability.
Findings
Effective acceleration on multiple tasks
High confidence in output consistency
Maintains performance while reducing computation
Abstract
We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms
