SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

Qiuyu Zhu; Liang Zhang; Qianxiong Xu; Cheng Long; Jie Zhang

arXiv:2505.10951·cs.LG·May 20, 2025

SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang

PDF

Open Access 1 Video

TL;DR

SubGCache accelerates graph-based retrieval-augmented generation by reusing pre-computed key-value caches for similar subgraphs, significantly reducing inference latency while maintaining or improving output quality.

Contribution

We introduce SubGCache, a novel method that clusters queries by subgraph similarity and reuses pre-computed caches to speed up graph-based RAG with minimal quality loss.

Findings

01

Up to 6.68× reduction in time-to-first-token.

02

Consistent latency improvements across multiple datasets and models.

03

Maintains or improves generation quality.

Abstract

Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache· underline

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Softmax · Attention Dropout · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding