Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters
Hongyu Zhao, Hao Tan, Hongyuan Mei

TL;DR
This paper introduces tiny-attention adapters for pretrained language models, which use small attention modules to improve transfer learning efficiency and effectiveness, outperforming existing methods with minimal parameter updates.
Contribution
The paper proposes tiny-attention adapters that utilize attention mechanisms with small per-head dimensions, offering a novel approach to parameter-efficient transfer learning.
Findings
Outperforms other parameter-efficient methods on GLUE
Achieves comparable results to GPT-3 and PET on FewGLUE
Uses only 0.05% of parameters for tuning
Abstract
Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Weight Decay · Linear Layer · Layer Normalization · Cosine Annealing · Byte Pair Encoding · Residual Connection
