Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs
Melika Mobini, Vincent Holst, Floriano Tori, Andres Algaba, Vincent Ginis

TL;DR
This paper demonstrates that LLM-generated bibliographies closely mimic human citation networks in structure but can be distinguished by their semantic content using embeddings and GNNs, enabling detection of AI-generated references.
Contribution
The study introduces a novel method combining embeddings and graph neural networks to effectively detect LLM-generated references based on semantic content, surpassing structure-only approaches.
Findings
Embeddings significantly improve detection accuracy over structure alone.
GNNs with embeddings achieve 93% accuracy in distinguishing GPT from human references.
Semantic fingerprints enable reliable detection of AI-generated bibliographies.
Abstract
Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ( 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy 0.60) despite cleanly rejecting the random baseline ( 0.89--0.92). By contrast, embeddings sharply increase separability: RF on…
Peer Reviews
Decision·ICLR 2026 Poster
The following are 3 key strong points of this paper: 1. Authors carried out rigorous experimental design by using paired graphs per focal paper and apply field-matched randomization to break subtle citation structure, making conclusions robust. This is one of the strongest points of this paper. 2. Authors have done a good job of clearly decomposing Structural vs. Semantic signals. This leads to better interpretability and diagnostic insights. 3. Their experiments provide strong empirical evid
1. The analysis carried out in this paper is limited to only GPT-4o. Concept drift/divergence between models is noted but not tested. This is a serious limitation of this paper. 2. While semantic embeddings separate classes, the authors do not identify what semantic dimensions differ (recency, methodology, prestige, jargon)? 3. For GNNs, only titles and simple metrics are used. Full-text content or citation contexts might yield deeper insights.. Is there any reason for this?
1. The findings of the paper are timely, as LLM-generated scientific bibliographies are emerging. Knowing the bias in the generated citations would be helpful to all researchers (and reviewers). 2. The paper provides firm empirical grounding. Not only does the paper present the accuracy results, but it also includes descriptive statistics and visualization to confirm/motivate the results (Figures 2 and 3). 3. The paper presents a carefully controlled dataset of ~9k citation graphs across three c
1. The paper focuses on "parametric knowledge" and GPT-4o. This setting might not reflect all LLMs' behaviors nor citation recommendation systems that increasingly perform "deep research". 2. While the paper argues that semantic embeddings reveal the key differences between human and GPT-generated citation graphs, this claim is potentially confounded by the extreme disparity in feature dimensionality (3072 vs 5). The observed accuracy gain in the Random Forest might partly reflect model capacit
### Strengths - **Originality:** This paper addresses the novel problem of distinguishing LLM-generated reference lists from human ones by constructing citation graphs and developing three classification approaches. While citation graphs and Random Forests are established methods, their application to this specific problem represents a new use case. - **Quality:** The study demonstrates a solid experimental setup by evaluating three distinct classification approaches across multiple mo
### Weaknesses - **Related Work:** The paper lacks a dedicated related work section and does not fully contextualize its contribution. The authors briefly mention prior studies by Algaba et al. (2024), Mobini et al. (2025), and Algaba et al. (2025), but do not clarify the distinct contributions of each. A more thorough discussion is needed, especially regarding connections to related areas such as LLM-generated content detection, citation network analysis, citation generation, citation reco
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Biomedical Text Mining and Ontologies
