Predicting New Concept-Object Associations in Astronomy by Mining the Literature

Jinchu Li; Yuan-Sen Ting; Alberto Accomazzi; Tirthankar Ghosal; Nesar Ramachandra

arXiv:2602.14335·astro-ph.IM·April 14, 2026

Predicting New Concept-Object Associations in Astronomy by Mining the Literature

Jinchu Li, Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Nesar Ramachandra

PDF

TL;DR

This paper constructs a knowledge graph from astrophysics literature to predict future concept-object associations, demonstrating that historical data can forecast new scientific links before publication.

Contribution

It introduces an automated pipeline for extracting and linking astrophysical concepts and objects, and shows that a matrix factorization model with smoothing outperforms baselines in predicting new associations.

Findings

01

ALS model with smoothing outperforms neighborhood baseline by 16.8% in NDCG@100.

02

Model exceeds recency heuristic by 96% in Recall@100.

03

Historical literature encodes predictive structure beyond heuristics.

Abstract

We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.