SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache
Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang

TL;DR
SubGCache accelerates graph-based retrieval-augmented generation by reusing pre-computed key-value caches for similar subgraphs, significantly reducing inference latency while maintaining or improving output quality.
Contribution
We introduce SubGCache, a novel method that clusters queries by subgraph similarity and reuses pre-computed caches to speed up graph-based RAG with minimal quality loss.
Findings
Up to 6.68× reduction in time-to-first-token.
Consistent latency improvements across multiple datasets and models.
Maintains or improves generation quality.
Abstract
Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Softmax · Attention Dropout · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding
