HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction
Qianyue Hao, Jingyang Fan, Fengli Xu, Jian Yuan, Yong Li

TL;DR
This paper introduces HLM-Cite, a hybrid workflow leveraging embedding and generative language models to predict core citations in scientific papers, addressing challenges of scale and implicit logical relationships, with significant performance gains.
Contribution
The paper proposes a novel hybrid LLM workflow and core citation concept to improve citation prediction by distinguishing critical references from superficial mentions.
Findings
Achieved 17.6% improvement over state-of-the-art methods.
Scalable to 100K candidate papers.
Effective across 19 scientific fields.
Abstract
Citation networks are critical in modern science, and predicting which previous papers (candidates) will a new paper (query) cite is a critical problem. However, the roles of a paper's citations vary significantly, ranging from foundational knowledge basis to superficial contexts. Distinguishing these roles requires a deeper understanding of the logical relationships among papers, beyond simple edges in citation networks. The emergence of LLMs with textual reasoning capabilities offers new possibilities for discerning these relationships, but there are two major challenges. First, in practice, a new paper may select its citations from gigantic existing papers, where the texts exceed the context length of LLMs. Second, logical relationships between papers are implicit, and directly prompting an LLM to predict citations may result in surface-level textual similarities rather than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling
