Task-Adaptive Pretrained Language Models via Clustered-Importance   Sampling

David Grangier; Simin Fan; Skyler Seto; Pierre Ablin

arXiv:2410.03735·cs.CL·March 12, 2025

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

David Grangier, Simin Fan, Skyler Seto, Pierre Ablin

PDF

Open Access 1 Video

TL;DR

This paper introduces CRISP, a clustering-based sampling method that enhances the training of specialist language models from limited domain data by leveraging large generalist datasets, improving performance across tasks.

Contribution

The paper proposes CRISP, a scalable clustering and sampling technique that enables effective specialist model training from generalist data with limited domain-specific samples.

Findings

01

CRISP improves perplexity and accuracy on multiple tasks.

02

It outperforms other data sampling methods.

03

Ablation studies highlight the importance of dataset size and clustering choices.

Abstract

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We propose a novel method, ClusteRed Importance SamPling (CRISP). CRISP clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for both pretraining and continued pretraining, and works well in multi-task settings. CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of language modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsFocus