Domain Pre-training Impact on Representations
Cesar Gonzalez-Gutierrez, Ariadna Quattoni

TL;DR
This study investigates how the choice and composition of pre-training corpora influence the quality of transformer representations, highlighting the importance of corpus similarity and size in effective model training.
Contribution
It provides empirical evidence on the impact of different pre-training corpora, emphasizing the role of corpus similarity and size in representation quality.
Findings
Pre-training on small, specialized corpora can produce effective representations.
Combining generic and specialized corpora benefits depends on their distributional similarity.
Representation quality is primarily influenced by the pre-training corpus rather than model architecture.
Abstract
This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
