INDUS: Effective and Efficient Language Models for Scientific   Applications

Bishwaranjan Bhattacharjee; Aashka Trivedi; Masayasu Muraoka,; Muthukumaran Ramasubramanian; Takuma Udagawa; Iksha Gurung; Nishan Pantha,; Rong Zhang; Bharath Dandala; Rahul Ramachandran; Manil Maskey; Kaylin Bugbee,; Mike Little; Elizabeth Fancher; Irina Gerasimov; Armin Mehrabian; Lauren; Sanders; Sylvain Costes; Sergi Blanco-Cuaresma; Kelly Lockhart; Thomas Allen,; Felix Grezes; Megan Ansdell; Alberto Accomazzi; Yousef El-Kurdi; Davis; Wertheimer; Birgit Pfitzmann; Cesar Berrospi Ramis; Michele Dolfi; Rafael; Teixeira de Lima; Panagiotis Vagenas; S. Karthik Mukkavilli; Peter Staar,; Sanaz Vahidinia; Ryan McGranaghan; Tsendgar Lee

arXiv:2405.10725·cs.CL·November 1, 2024·2 cites

INDUS: Effective and Efficient Language Models for Scientific Applications

Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka,, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Nishan Pantha,, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee,, Mike Little, Elizabeth Fancher, Irina Gerasimov

PDF

Open Access 6 Models 4 Datasets 1 Video

TL;DR

INDUS introduces a suite of specialized language models trained on scientific data across multiple disciplines, outperforming general and domain-specific models on new benchmarks and demonstrating practical industrial applications.

Contribution

The paper presents new scientific domain-specific language models and benchmark datasets, tailored for Earth science, biology, physics, and related fields, with improved performance over existing models.

Findings

01

Models outperform RoBERTa and SCIBERT on new scientific benchmarks.

02

Effective in retrieval and content tagging applications.

03

Smaller models maintain high performance with resource efficiency.

Abstract

Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

INDUS: Effective and Efficient Language Models for Scientific Applications· underline

Taxonomy

TopicsTopic Modeling

MethodsSparse Evolutionary Training · Knowledge Distillation