HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific   Citation Prediction

Qianyue Hao; Jingyang Fan; Fengli Xu; Jian Yuan; Yong Li

arXiv:2410.09112·cs.DL·October 15, 2024

HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction

Qianyue Hao, Jingyang Fan, Fengli Xu, Jian Yuan, Yong Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces HLM-Cite, a hybrid workflow leveraging embedding and generative language models to predict core citations in scientific papers, addressing challenges of scale and implicit logical relationships, with significant performance gains.

Contribution

The paper proposes a novel hybrid LLM workflow and core citation concept to improve citation prediction by distinguishing critical references from superficial mentions.

Findings

01

Achieved 17.6% improvement over state-of-the-art methods.

02

Scalable to 100K candidate papers.

03

Effective across 19 scientific fields.

Abstract

Citation networks are critical in modern science, and predicting which previous papers (candidates) will a new paper (query) cite is a critical problem. However, the roles of a paper's citations vary significantly, ranging from foundational knowledge basis to superficial contexts. Distinguishing these roles requires a deeper understanding of the logical relationships among papers, beyond simple edges in citation networks. The emergence of LLMs with textual reasoning capabilities offers new possibilities for discerning these relationships, but there are two major challenges. First, in practice, a new paper may select its citations from gigantic existing papers, where the texts exceed the context length of LLMs. Second, logical relationships between papers are implicit, and directly prompting an LLM to predict citations may result in surface-level textual similarities rather than the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsinghua-fib-lab/H-LM
pytorchOfficial

Videos

HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction· slideslive

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling