Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics
Arno Simons

TL;DR
Astro-HEP-BERT is a domain-specific language model built on BERT, trained on astrophysics and high-energy physics literature, demonstrating effective semantic understanding with minimal additional training.
Contribution
This work introduces Astro-HEP-BERT, a specialized transformer model for scientific concepts in astrophysics and high-energy physics, showing efficient adaptation of general models for domain-specific tasks.
Findings
Comparable performance to larger domain-specific models in word sense disambiguation
Effective semantic change analysis in scientific texts
Cost-effective adaptation of general language models for scientific domains
Abstract
I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data Technologies and Applications · Advanced Data Processing Techniques · Computational and Text Analysis Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Weight Decay · Softmax · Multi-Head Attention · Dense Connections · Dropout
