Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings

Junseon Yoo

arXiv:2605.07158·cs.IR·May 11, 2026

Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings

Junseon Yoo

PDF

TL;DR

This paper reveals that current text embeddings often fail to capture research agendas within scientific literature, leading to off-topic retrievals, and proposes citation-based signals as a more reliable alternative.

Contribution

It provides a large-scale analysis of embedding limitations in scientific retrieval and introduces citation graph signals as a diagnostic tool for agenda alignment.

Findings

01

Embeddings achieve 45-52% top-10 same-agenda retrieval at broad sub-field level.

02

Performance drops to 15-21% at finer research-agenda granularity.

03

Citation-based reranking significantly improves agenda matching accuracy.

Abstract

Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.