Domain Pre-training Impact on Representations

Cesar Gonzalez-Gutierrez; Ariadna Quattoni

arXiv:2505.24455·cs.CL·June 2, 2025

Domain Pre-training Impact on Representations

Cesar Gonzalez-Gutierrez, Ariadna Quattoni

PDF

Open Access

TL;DR

This study investigates how the choice and composition of pre-training corpora influence the quality of transformer representations, highlighting the importance of corpus similarity and size in effective model training.

Contribution

It provides empirical evidence on the impact of different pre-training corpora, emphasizing the role of corpus similarity and size in representation quality.

Findings

01

Pre-training on small, specialized corpora can produce effective representations.

02

Combining generic and specialized corpora benefits depends on their distributional similarity.

03

Representation quality is primarily influenced by the pre-training corpus rather than model architecture.

Abstract

This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus