INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka,, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Nishan Pantha,, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee,, Mike Little, Elizabeth Fancher, Irina Gerasimov

TL;DR
INDUS introduces a suite of specialized language models trained on scientific data across multiple disciplines, outperforming general and domain-specific models on new benchmarks and demonstrating practical industrial applications.
Contribution
The paper presents new scientific domain-specific language models and benchmark datasets, tailored for Earth science, biology, physics, and related fields, with improved performance over existing models.
Findings
Models outperform RoBERTa and SCIBERT on new scientific benchmarks.
Effective in retrieval and content tagging applications.
Smaller models maintain high performance with resource efficiency.
Abstract
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nasa-impact/nasa-smd-ibm-v0.1model· 118 dl· ♡ 34118 dl♡ 34
- 🤗nasa-impact/nasa-smd-ibm-stmodel· 4 dl· ♡ 144 dl♡ 14
- 🤗nasa-impact/nasa-smd-ibm-st-v2model· 175 dl· ♡ 12175 dl♡ 12
- 🤗nasa-impact/nasa-smd-ibm-rankermodel· 5 dl· ♡ 35 dl♡ 3
- 🤗nasa-impact/nasa-ibm-st.38mmodel· 10 dl· ♡ 710 dl♡ 7
- 🤗nasa-impact/nasa-smd-ibm-distil-v0.1model· 6 dl· ♡ 86 dl♡ 8
Videos
Taxonomy
TopicsTopic Modeling
MethodsSparse Evolutionary Training · Knowledge Distillation
