Hierarchical Multitask Learning Approach for BERT

\c{C}a\u{g}la Aksoy; Alper Ahmeto\u{g}lu; Tunga G\"ung\"or

arXiv:2011.04451·cs.CL·November 10, 2020·1 cites

Hierarchical Multitask Learning Approach for BERT

\c{C}a\u{g}la Aksoy, Alper Ahmeto\u{g}lu, Tunga G\"ung\"or

PDF

Open Access

TL;DR

This paper introduces a hierarchical multitask learning approach for BERT pre-training, transferring information between tasks at different layers and adding a new bigram shift task to improve embeddings for downstream NLP tasks.

Contribution

It proposes a novel hierarchical multitask framework for BERT pre-training, including a new bigram shift task and task-specific information transfer across layers.

Findings

01

Hierarchical multitask learning improves embedding quality.

02

Task hierarchy enhances downstream task performance.

03

Proposed methods outperform baseline BERT on probing tasks.

Abstract

Recent works show that learning contextualized embeddings for words is beneficial for downstream tasks. BERT is one successful example of this approach. It learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP). The pre-training of BERT can also be framed as a multitask learning problem. In this work, we adopt hierarchical multitask learning approaches for BERT pre-training. Pre-training tasks are solved at different layers instead of the last layer, and information from the NSP task is transferred to the masked LM task. Also, we propose a new pre-training task bigram shift to encode word order information. We choose two downstream tasks, one of which requires sentence-level embeddings (textual entailment), and the other requires contextualized embeddings of words (question answering). Due to computational restrictions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay