Explaining the Effectiveness of Multi-Task Learning for Efficient   Knowledge Extraction from Spine MRI Reports

Arijit Sehanobish; McCullen Sandora; Nabila Abraham; Jayashri Pawar,; Danielle Torres; Anasuya Das; Murray Becker; Richard Herzog; Benjamin Odry,; Ron Vianu

arXiv:2205.02979·cs.LG·May 9, 2022

Explaining the Effectiveness of Multi-Task Learning for Efficient Knowledge Extraction from Spine MRI Reports

Arijit Sehanobish, McCullen Sandora, Nabila Abraham, Jayashri Pawar,, Danielle Torres, Anasuya Das, Murray Becker, Richard Herzog, Benjamin Odry,, Ron Vianu

PDF

Open Access

TL;DR

This paper investigates why multi-task learning with transformers is effective, showing that aligned representations and gradients across tasks enable a single model to perform as well as task-specific models, validated on spine MRI report datasets.

Contribution

It demonstrates that aligned hidden representations and gradients across tasks explain multi-task learning effectiveness, validated on radiologist-annotated spine MRI datasets.

Findings

01

Single multi-task model matches task-specific models when representations are aligned.

02

Aligned gradients and representations across tasks are key to multi-task learning success.

03

Method is simple, intuitive, and applicable to various NLP problems.

Abstract

Pretrained Transformer based models finetuned on domain specific corpora have changed the landscape of NLP. However, training or fine-tuning these models for individual tasks can be time consuming and resource intensive. Thus, a lot of current research is focused on using transformers for multi-task learning (Raffel et al.,2020) and how to group the tasks to help a multi-task model to learn effective representations that can be shared across tasks (Standley et al., 2020; Fifty et al., 2021). In this work, we show that a single multi-tasking model can match the performance of task specific models when the task specific models show similar representations across all of their hidden layers and their gradients are aligned, i.e. their gradients follow the same direction. We hypothesize that the above observations explain the effectiveness of multi-task learning. We validate our observations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Dropout