HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable   Hyper Projections

Yi Tay; Zhe Zhao; Dara Bahri; Donald Metzler; Da-Cheng Juan

arXiv:2007.05891·cs.CL·July 14, 2020·5 cites

HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections

Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan

PDF

Open Access

TL;DR

HyperGrid introduces a decomposable hypernetwork that enables efficient multi-task learning with a single model, achieving state-of-the-art results on NLP benchmarks by specializing weight matrix regions for different tasks.

Contribution

It proposes a novel grid-wise hypernetwork approach that learns task-specific weight projections, improving multi-task learning efficiency and performance over traditional fine-tuning methods.

Findings

01

Strong performance on GLUE and SuperGLUE benchmarks

02

Reduces parameter costs compared to fine-tuning multiple models

03

Bridges the gap between fine-tuning and multi-task learning

Abstract

Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Gated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · HyperNetwork · Attention Dropout · Inverse Square Root Schedule · Byte Pair Encoding · Dense Connections · Dropout · SentencePiece