Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings
Junseon Yoo

TL;DR
This paper reveals that current text embeddings often fail to capture research agendas within scientific literature, leading to off-topic retrievals, and proposes citation-based signals as a more reliable alternative.
Contribution
It provides a large-scale analysis of embedding limitations in scientific retrieval and introduces citation graph signals as a diagnostic tool for agenda alignment.
Findings
Embeddings achieve 45-52% top-10 same-agenda retrieval at broad sub-field level.
Performance drops to 15-21% at finer research-agenda granularity.
Citation-based reranking significantly improves agenda matching accuracy.
Abstract
Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
