Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents
Mihael Arcan

TL;DR
This study evaluates the impact of structured knowledge triples on clustering and classification of scientific papers, finding that text-only embeddings often outperform knowledge-infused variants depending on the configuration.
Contribution
It provides a comprehensive benchmark and analysis of knowledge-infused versus text-only embeddings for scientific document organization.
Findings
Abstract-only embeddings achieve highest classification accuracy (~0.923)
Adding triples does not consistently improve clustering or classification performance
KMeans and GMM outperform HDBSCAN in external validity metrics
Abstract
The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction. Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
