RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines
Austin Jia, Avaneesh Ramesh, Zain Shamsi, Daniel Zhang, and Alex Liu

TL;DR
This paper introduces RAGRank, a method that enhances retrieval-augmented generation systems in cyber threat intelligence by applying PageRank to identify and prioritize credible sources, thereby mitigating poisoning attacks.
Contribution
The paper proposes using PageRank-based source credibility scoring to improve the robustness of RAG systems against poisoning in CTI contexts, demonstrating effectiveness on standard and CTI datasets.
Findings
PageRank reduces influence of malicious documents
Improves trustworthiness of retrieved content
Effective on CTI-specific data
Abstract
Retrieval-Augmented Generation (RAG) has emerged as the dominant architectural pattern to operationalize Large Language Model (LLM) usage in Cyber Threat Intelligence (CTI) systems. However, this design is susceptible to poisoning attacks, and previously proposed defenses can fail for CTI contexts as cyber threat information is often completely new for emerging attacks, and sophisticated threat actors can mimic legitimate formats, terminology, and stylistic conventions. To address this issue, we propose that the robustness of modern RAG defenses can be accelerated by applying source credibility algorithms on corpora, using PageRank as an example. In our experiments, we demonstrate quantitatively that our algorithm applies a lower authority score to malicious documents while promoting trusted content, using the standardized MS MARCO dataset. We also demonstrate proof-of-concept…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection
