Predicting New Concept-Object Associations in Astronomy by Mining the Literature
Jinchu Li, Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Nesar Ramachandra

TL;DR
This paper constructs a knowledge graph from astrophysics literature to predict future concept-object associations, demonstrating that historical data can forecast new scientific links before publication.
Contribution
It introduces an automated pipeline for extracting and linking astrophysical concepts and objects, and shows that a matrix factorization model with smoothing outperforms baselines in predicting new associations.
Findings
ALS model with smoothing outperforms neighborhood baseline by 16.8% in NDCG@100.
Model exceeds recency heuristic by 96% in Recall@100.
Historical literature encodes predictive structure beyond heuristics.
Abstract
We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
