TextGram: Towards a better domain-adaptive pretraining
Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane,, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

TL;DR
TextGram introduces a novel domain-adaptive data selection method that improves pretraining efficiency and effectiveness for large language models, reducing computational costs while maintaining accuracy.
Contribution
The paper proposes TextGram, a new data selection strategy for domain-adaptive pretraining that outperforms existing methods in selecting essential data for NLP tasks.
Findings
TextGram improves data selection efficiency.
Selected data maintains model accuracy.
Outperforms existing selection strategies.
Abstract
For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing
