Tiny-Attention Adapter: Contexts Are More Important Than the Number of   Parameters

Hongyu Zhao; Hao Tan; Hongyuan Mei

arXiv:2211.01979·cs.CL·November 4, 2022

Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

Hongyu Zhao, Hao Tan, Hongyuan Mei

PDF

Open Access

TL;DR

This paper introduces tiny-attention adapters for pretrained language models, which use small attention modules to improve transfer learning efficiency and effectiveness, outperforming existing methods with minimal parameter updates.

Contribution

The paper proposes tiny-attention adapters that utilize attention mechanisms with small per-head dimensions, offering a novel approach to parameter-efficient transfer learning.

Findings

01

Outperforms other parameter-efficient methods on GLUE

02

Achieves comparable results to GPT-3 and PET on FewGLUE

03

Uses only 0.05% of parameters for tuning

Abstract

Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Weight Decay · Linear Layer · Layer Normalization · Cosine Annealing · Byte Pair Encoding · Residual Connection