Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

Mihael Arcan

arXiv:2601.08841·cs.CL·April 21, 2026

Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

Mihael Arcan

PDF

TL;DR

This study evaluates the impact of structured knowledge triples on clustering and classification of scientific papers, finding that text-only embeddings often outperform knowledge-infused variants depending on the configuration.

Contribution

It provides a comprehensive benchmark and analysis of knowledge-infused versus text-only embeddings for scientific document organization.

Findings

01

Abstract-only embeddings achieve highest classification accuracy (~0.923)

02

Adding triples does not consistently improve clustering or classification performance

03

KMeans and GMM outperform HDBSCAN in external validity metrics

Abstract

The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction. Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.