Hierarchical Multitask Learning Approach for BERT
\c{C}a\u{g}la Aksoy, Alper Ahmeto\u{g}lu, Tunga G\"ung\"or

TL;DR
This paper introduces a hierarchical multitask learning approach for BERT pre-training, transferring information between tasks at different layers and adding a new bigram shift task to improve embeddings for downstream NLP tasks.
Contribution
It proposes a novel hierarchical multitask framework for BERT pre-training, including a new bigram shift task and task-specific information transfer across layers.
Findings
Hierarchical multitask learning improves embedding quality.
Task hierarchy enhances downstream task performance.
Proposed methods outperform baseline BERT on probing tasks.
Abstract
Recent works show that learning contextualized embeddings for words is beneficial for downstream tasks. BERT is one successful example of this approach. It learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP). The pre-training of BERT can also be framed as a multitask learning problem. In this work, we adopt hierarchical multitask learning approaches for BERT pre-training. Pre-training tasks are solved at different layers instead of the last layer, and information from the NSP task is transferred to the masked LM task. Also, we propose a new pre-training task bigram shift to encode word order information. We choose two downstream tasks, one of which requires sentence-level embeddings (textual entailment), and the other requires contextualized embeddings of words (question answering). Due to computational restrictions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay
