The interplay between domain specialization and model size

Roseval Malaquias Junior; Ramon Pires; Thales Sales Almeida; Kenzo; Sakiyama; Roseli A. F. Romero; Rodrigo Nogueira

arXiv:2501.02068·cs.CL·April 1, 2025

The interplay between domain specialization and model size

Roseval Malaquias Junior, Ramon Pires, Thales Sales Almeida, Kenzo, Sakiyama, Roseli A. F. Romero, Rodrigo Nogueira

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how domain specialization during continued pretraining affects model performance and efficiency across different model sizes, revealing that larger specialized models outperform general ones with less compute and less forgetting.

Contribution

It provides insights into the optimal training regimes for domain-specific continued pretraining across various model sizes, highlighting efficiency gains and knowledge retention.

Findings

01

Specialized models outperform general models as size increases.

02

Larger specialized models require less compute for training.

03

Specialization reduces forgetting of previously learned knowledge.

Abstract

Scaling laws for language models have often focused on finding the optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continued pretraining offers a cost-effective alternative, leveraging the compute investment from pretrained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

Compute Efficiency and Forgetting Analysis:Introducing the SGER (Specialized-to-General Efficiency Ratio) metric The Related Work section is comprehensive and well-situated in recent scaling law and domain-adaptation literature

Weaknesses

Plots and cross-suite comparisons appear to use the test suite to pick the “minimum perplexity” checkpoint, then report those results. This is classic evaluation leakage; that minimum should be picked on a validation split that is disjoint from the reported test metrics Power-law claims (and the SGER vs size trend) are fit on four points without confidence intervals, fit method details, or ablations. This risks over-interpreting noise. (You do acknowledge this in Limitations, but the paper sti

Reviewer 02Rating 6Confidence 3

Strengths

This paper presents a compelling and timely investigation into the interplay between model size and domain specialization during continued pretraining under compute-constrained settings. Its strengths span multiple dimensions—originality, quality, clarity, and significance—and collectively position it as a valuable contribution to the field of efficient language model adaptation. 1. Originality: High – Novel Problem Formulation with Fresh Insights The paper’s originality lies in its novel frami

Weaknesses

While the paper presents a compelling narrative with strong experimental design and significant implications, several weaknesses—though not fatal—limit the robustness, generalizability, and depth of its conclusions. Below is a detailed critique focused on specific shortcomings, supported by concrete suggestions for improvement. 1. Narrow Definition of "Specialization": Risk of Confounding Data Quality with Domain Focus The core claim—that domain specialization improves performance under compute

Reviewer 03Rating 2Confidence 4

Strengths

(1) This paper aims to explore compute-optimal continued pretraining, a significant research direction.

Weaknesses

(1) Writing is needed to improve, as it is difficult to understand: a) The key concept of this paper, “domain specialization”, is not well explained in this paper. The biggest confusion I had when reading this article was in which aspects a good "domain specification" should be quantified? Is it by comparing the lowest ppl achieved under the same computational budget (i.e., 6ND)? b) In line 60 the authors first propose a hypothesis “larger models exhibit greater capacity to retain learned knowl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · Artificial Intelligence in Law