TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar; Saloni Mittal; Vidula Magdum; Omkar Dhekane,; Raviraj Joshi; Geetanjali Kale; Arnav Ladkat

arXiv:2404.18228·cs.CL·April 30, 2024

TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane,, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

PDF

TL;DR

TextGram introduces a novel domain-adaptive data selection method that improves pretraining efficiency and effectiveness for large language models, reducing computational costs while maintaining accuracy.

Contribution

The paper proposes TextGram, a new data selection strategy for domain-adaptive pretraining that outperforms existing methods in selecting essential data for NLP tasks.

Findings

01

TextGram improves data selection efficiency.

02

Selected data maintains model accuracy.

03

Outperforms existing selection strategies.

Abstract

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing