SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

Shayan Vassef; Amirhossein Dabiriaghdam; Mohammadreza Bakhtiari; Yadollah Yaghoobzadeh

arXiv:2507.15236·cs.CL·July 22, 2025

SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

Shayan Vassef, Amirhossein Dabiriaghdam, Mohammadreza Bakhtiari, Yadollah Yaghoobzadeh

PDF

1 Video

TL;DR

This paper introduces the SOI framework to analyze training dynamics in pretrained language models across multiple settings, revealing how different training strategies affect model robustness and performance.

Contribution

It proposes the SOI categorization and visualization methods, providing new insights into training behaviors and a subset selection approach to improve model performance.

Findings

01

Multi-source learning enhances out-of-distribution robustness by up to 7%.

02

Multi-task learning yields mixed results, with gains in similar task pairs.

03

Two-stage fine-tuning with SOI-based subset selection improves performance.

Abstract

This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models. To enhance this analysis, we introduce Subsets of Interest (SOI), a novel categorization framework that identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples. Through SOI transition heatmaps and dataset cartography visualization, we analyze how examples shift between these categories when transitioning from single-setting to multi-setting configurations. We perform comprehensive experiments across three parallel comparisons: multi-task vs. single-task learning using English tasks (entailment, paraphrase, sentiment), multi-source vs. single-source learning using sentiment analysis datasets, and multi-lingual vs. single-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest· underline