KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Jason J.Jung, Khac-Hoai Nam Bui

TL;DR
KG-CQR enhances retrieval in knowledge graph-based systems by enriching query context through structured relation representations, leading to improved accuracy in retrieval tasks without additional training.
Contribution
It introduces a scalable, model-agnostic framework that leverages KG subgraph extraction and completion for query enrichment in retrieval-augmented generation.
Findings
Achieves 4-6% improvement in mAP over baselines
Attains 2-3% higher Recall@25
Outperforms existing methods in multi-hop question answering
Abstract
The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Information Retrieval and Search Behavior
KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui1∗, Ngoc Mai Thieu1∗, Van Vinh Nguyen2, Jason J. Jung3, Khac-Hoai Nam Bui1†**
1Viettel AI, Viettel Group, Vietnam
2University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
3Department of Computer Engineering, Chung-Ang University, Korea
{minhbc4, maitn4}@viettel.com.vn, [email protected], [email protected], [email protected]
Abstract
The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to enhance the retrieval stage in retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR111https://github.com/tnmai59/KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching complex input queries with contextual representations derived from a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on the RAGBench and MultiHop-RAG datasets demonstrate that KG-CQR outperforms strong baselines, achieving improvements of up to 4–6% in mAP and approximately 2–3% in Recall@25. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance outperforms the existing baseline in terms of retrieval effectiveness.
KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
** Chi Minh Bui1∗, Ngoc Mai Thieu1∗, Van Vinh Nguyen2, Jason J. Jung3, Khac-Hoai Nam Bui1*†***
1Viettel AI, Viettel Group, Vietnam
2University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
3Department of Computer Engineering, Chung-Ang University, Korea
{minhbc4, maitn4}@viettel.com.vn, [email protected], [email protected], [email protected]
††∗ Equal Contribution††† Corresponding Author
1 Introduction
Large Language Models (LLMs) have significantly advanced the field of natural language processing (NLP), particularly in understanding and generating human-like text. However, LLMs still suffer from two critical limitations: a lack of reliable factual knowledge and limited reasoning capabilities Wang et al. (2024b).
These limitations are exacerbated when LLMs are applied to domain-specific knowledge retrieval, especially in addressing queries within vertical domains Bang et al. (2023). To address these challenges, recent research has explored the integration of knowledge graphs (KGs) into LLMs as a means to provide structured, accurate knowledge sources for enhanced reasoning Pan et al. (2024). KGs, which store facts in the form of triples (i.e., head entity, relation, tail entity), offer a robust and interpretable representation of knowledge. Consequently, KGs have been increasingly incorporated into applications based on LLMs to improve performance across various tasks, such as question answering Ding et al. (2024), fact verification Pham et al. (2025a), and recommendation systems Abu-Rasheed et al. (2024).
In the context of question answering over knowledge graphs (KGQA), current approaches can be broadly categorized into two main strategies: (i) using LLMs to convert natural language queries into formal logical queries, which are then executed on KGs to derive answers Nguyen et al. (2024); Wang et al. (2024a); and (ii) retrieving relevant triples from KGs and presenting them as contextual knowledge for the LLM to generate the final answer Sarmah et al. (2024); Sun et al. (2024). Similarly, in retrieval-augmented generation (RAG) tasks, external knowledge sources, in terms of both structured (KGs) and unstructured (vectorized documents), are retrieved and incorporated into the input prompt to support answer generation by LLMs Li et al. (2024); Edge et al. (2024). Despite these advances, the retrieval process involving KGs remains underexplored in the aforementioned approaches.
This study focuses on enhancing the retrieval process for RAG systems by integrating KG technologies to enrich contextual information for complex input queries. Specifically, the objective is to tackle a critical challenge in current systems: misalignment between query and document embeddings Ma et al. (2023). Accordingly, existing methods often employ LLMs to decompose complex queries Mao et al. (2024) (Figure 1(a)). Nonetheless, in terms of retrieval performance, this approach frequently underperforms due to insufficient contextual alignment with the corpus. Subsequently, Gao et al. (2023) proposed a new approach by generating hypothetical documents to facilitate document-document similarity comparisons (Figure 1(b)). However, this method heavily relies on underlying LLMs, introducing risks of hallucination. In terms of knowledge-grounded expansion generation, Xia et al. (2025) introduced a knowledge-aware approach that leverages both unstructured data and structured relations. Nevertheless, their reliance on predefined relation schemas between entities (e.g., title) and documents constrains the scalability and adaptability.
To overcome the aforementioned limitations, we propose KG-CQR (Knowledge Graph for Contextual Query Retrieval), a novel framework that leverages KG to generate contextual information for input queries (Figure 1(c)). The key idea is to extract a relevant subgraph from the KG to enrich each query semantically. KG-CQR comprises three main modules: (i) subgraph extraction, which identifies relevant triples; (ii) subgraph completion, which infers missing triples; and (iii) contextual generation, which constructs enriched query contexts. These modules utilize a new structured representation of relations, combining textual information with KG triplets, to address the limitations of traditional entity-based scoring in KG extraction. By retrieving directly relevant data and inferring missing knowledge, KG-CQR significantly improves query contextualization. The main contributions of this work are as follows:
- •
We propose Contextual Query Retrieval (CQR), a novel paradigm designed to enhance the context of domain-specific queries using a predefined corpus. Our framework, KG-CQR, leverages a corpus-centric knowledge graph to improve both query understanding and retrieval effectiveness, achieving these improvements without the need for additional training.
- •
KG-CQR functions as a model-agnostic pipeline that employs structured relation representations to generate contextual information, ensuring adaptability and scalability across backbone LLMs with varying parameter sizes.
- •
Extensive experiments on complex benchmark datasets, specifically designed for multi-step retrieval processes in RAG systems. The results demonstrate the effectiveness of KG-CQR in enhancing retrieval quality.
2 Literature Review
2.1 Query Expansion using LLM
To handle complex queries effectively, query expansion is often essential for improving the performance of the retrieval process Azad and Deepak (2019). Traditional approaches decompose input queries into multi-view representations to enhance retrieval accuracy Zhang et al. (2022). Recently, with the rapid advancement of LLMs, a promising direction involves query enhancement, either through prompt-based techniques leveraging LLMs Wang et al. (2023), or by developing trainable frameworks that generate refined queries Mao et al. (2024). These methods aim to reformulate queries into more effective semantic representations Chan et al. (2024); Chen et al. (2024). However, they still struggle to bridge the inherent gap between queries and the knowledge corpus within the retrieval embedding space Liu et al. (2025). Accordingly, to further improve retrieval effectiveness, especially in domain-specific applications, a deeper exploitation of contextual generation remains essential Li et al. (2025).
2.2 Contextual Retrieval
Contextualized retrieval has emerged as an effective strategy for improving retrieval performance, particularly in complex and challenging settings Morris and Rush (2024). Recent methods, such as RAPTOR Sarthi et al. (2024), GraphRAG Edge et al. (2024), and HippoRAG Gutierrez et al. (2024), adopt recursive procedures that integrate embedding, clustering, and summarization techniques to construct hierarchical document representations using graph-based structures. Conceptually, these approaches follow a corpus-centric paradigm, wherein hierarchical structures are leveraged to enhance contextual retrieval across the original document corpus. In terms of query expansion through contextualization, Gao et al. (2023) proposes HyDE, a novel approach that leverages LLMs to generate hypothetical documents conditioned on the input query. Accordingly, the query is first processed by an LLM following specific instructions to produce hypothetical documents, which are then used as pseudo-contexts for retrieval based on document-to-document similarity. However, a key limitation of HyDE lies in its dependence on LLM-generated content, where potential inaccuracies or hallucinations can degrade retrieval effectiveness Zhang et al. (2024); Xia et al. (2025). Moreover, query expansion strategies must account for domain-specific context sensitivity, as the same entities may vary in meaning or relevance across different domains Bui et al. (2021). Therefore, this study proposes a novel contextual retrieval approach, which focuses on providing contextual information for the input query, based on the structured relation of the corpus-centric KG.
2.3 LLM-Powered KG Construction
One of the primary challenges in utilizing knowledge graphs (KGs) lies in their construction. Prior work relies on predefined KGs Xia et al. (2025), which limits the flexibility and scalability of the approach. In order to automatically construct a KG, given a set of unstructured data sources (corpus), knowledge graph construction (KGC) is typically framed as a structured prediction task, where models are trained to approximate target functions associated with various NLP tasks such as Named Entity Recognition (NER), Relation Extraction (RE), Entity Linking (EL), and knowledge graph completion Ye et al. (2022). However, training task-specific discriminative models often results in error propagation and limited adaptability across diverse tasks. To address these limitations, recent approaches reformulate KGC as a generative problem using sequence-to-sequence (Seq2Seq) models Lu et al. (2022). Powered by pre-trained models such as T5 Raffel et al. (2020), the Seq2Seq paradigm has demonstrated strong performance in multi-task training settings for KG construction. More recently, the emergence of LLMs has spurred interest in their application to KGC through zero-shot prompting techniques Pan et al. (2024); Zhu et al. (2024). Building on this line of work, our study leverages modern open-source LLMs, e.g., LLaMA-3.3-70B, to construct knowledge graphs by parsing and categorizing entities and their relationships directly from unstructured data.
3 Methodology
3.1 Preliminary
3.1.1 Structure Relation Representation
A corpus-centric KG includes a set of triplets (structured relations) , which are defined as follows:
[TABLE]
where is the set of entities and is the set of relations. Since the KG is not available for most specific domains, we follow the work in GraphRAG Edge et al. (2024) to construct the corpus-centric KG, which includes three sequential steps: i) Ingesting specific-domain unstructured data; ii) Extracting entities and their relationships using an external LLM; iii) Mapping entities through edges (relations) that contain detailed information about their relationships.
To further enhance the expressiveness of the KG, we extend each triplet with a textual triplet representation (TTR). Unlike traditional approaches that rely solely on structured relational properties, our method leverages LLMs to generate rich, natural language representations of each triplet, as defined below:
[TABLE]
where denotes the textual description of the relation, generated by an LLM based on the instruction prompt , the corresponding triplet , and the document from which the triplet was extracted. An overview of this process is illustrated in Figure 2. In this regard, the structured relation in Equation 1 is reformulated as:
[TABLE]
3.1.2 Problem Definition
The objective of the retrieval process is to extract the most relevant documents for the input query, in which the similarity score (i.e., cosine similarity) can be formulated as follows:
[TABLE]
The core challenge in this process lies in ensuring that the query vector (obtained via encoder ) and the document vector (obtained via encoder ) are embedded into a shared semantic space. Traditional retrieval models typically rely on supervised learning frameworks that train encoders using query-document pairs to learn such a shared embedding space Karpukhin et al. (2020); Santhanam et al. (2022). However, directly optimizing for query-document similarity often results in suboptimal retrieval performance, particularly when dealing with sparse or domain-specific queries. To address this limitation, we draw inspiration from the approach in Gao et al. (2023), which shifts focus toward generating contextual embeddings for the query. Notably, instead of encoding the query directly, we enrich it with contextual information derived from the corpus-centric KG. This enriched representation is then embedded in the document space, allowing the similarity computation to align with the document-document similarity paradigm. The revised retrieval formulation is as follows:
[TABLE]
Here, KG-CQR(q) denotes the KG-enhanced contextual information of the input query .
3.2 KG-CQR
The overview architecture of KG-CQR is illustrated in Figure 3, which includes three main sequence components, such as subgraph extraction, subgraph completion, and contextual generation.
3.2.1 Subgraph Extraction
Given an input query and a knowledge graph , the subgraph extraction module first identifies a set of relevant triples (), based on the input query. Traditional subgraph extraction methods typically begin by identifying entities mentioned in the query and then linking them to entities in the KG using entity linking (EL) techniques, such as using LLM prompting or specialized EL tools Sun et al. (2024). However, these approaches often assume that the KG is complete, i.e., all factual triples relevant to the query are present in the graph, which is rarely the case in real-world scenarios Xu et al. (2024). Furthermore, current subgraph extraction techniques predominantly rely on assessing semantic similarity at the entity or keyword level Sun et al. (2024); Luo et al. (2024). Nevertheless, this limited granularity often fails to capture sufficient textual context, thereby reducing extraction performance, particularly when input queries involve ambiguous entities Pham et al. (2025b); Xia et al. (2025). To address these limitations, we leverage textual representations of triples (as defined in Equation 2) to measure similarity with the input query. This approach enables subgraph extraction at the sentence level, rather than relying solely on the entity level. The subgraph extraction is formalized as follows:
[TABLE]
where is the embedding of the input query, and is a hyperparameter controlling the number of top-matching triples retrieved.
Sequentially, inspired by previous work for the subgraph extraction process Sun et al. (2024), a filtering step is performed using an LLM with a task-specific prompt to remove irrelevant triples:
[TABLE]
Here, denotes the instruction prompt used by the LLM for the final selection. The details of are provided in Appendix 6.5.
3.2.2 Subgraph Completion
The initial subgraph is extracted based on semantic similarity, typically resulting in a limited set of triplets that may lack sufficient contextual information. The goal of the subgraph completion function is to enrich this subgraph by incorporating additional triplets from the structure relation of KG () that form semantically meaningful paths between entities in . Relevance is assessed by aggregating the semantic similarities between the input query and triplet textual representations along these paths.
The subgraph completion proceeds through the following steps (Algorithm 1):
- •
Step 1: Extract entities from the initial subgraph .
- •
Step 2: Apply Beam Search, a heuristic-guided variant of Breadth-First Search (BFS), to identify the top-n candidate paths.
- •
Step 3: Filter out paths that contain nodes not present in the initial subgraph .
- •
Step 4: Select the top-K highest-scoring unique triplets, with K defaulting to 20.
- •
Step 5: Construct the completed subgraph by merging the initial subgraph with the selected triplets.
Notably, to reduce computational complexity in Step 2, instead of executing the naive BFS traversal, a limited number of nodes are expanded, guided by a heuristic function (BFSBeam). This function computes semantic similarity between the input query and aggregates the relevance scores of the TTRs along each path, which is illustrated in more detail in the Appendix 6.3.
3.2.3 Contextual Generation
The objective of the retrieval process is to identify the most relevant documents for a given input query by computing similarity scores, typically using cosine similarity between their vector representations, which is formally defined as:
[TABLE]
where represents the generation instruction prompt, as detailed in Appendix 6.5. The enriched subgraph serves as contextual input to the LLM, facilitating the generation of a contextually enriched query representation. This reformulated query can then be encoded within the same embedding space as the corpus documents, enabling effective retrieval.
3.3 Retrieval Fusion Function
The input query and its synthetic contextual information are embedded using a fusion encoder-based approach. This technique enables the retrieval system to go beyond superficial query-document matching by leveraging the interaction between the query and its enriched context, resulting in more accurate and semantically relevant retrieval outcomes Bruch et al. (2024). In this work, we adopt a weighted-sum fusion mechanism to compute the final query representation, defined as:
[TABLE]
This fusion mechanism proves especially effective in complex, multi-turn, or context-sensitive retrieval scenarios, where conventional query enhancement or decomposition methods often fall short. Consequently, the objective function in Equation 5 can be reformulated as:
[TABLE]
4 Experiment
4.1 Experimental Setup
Baseline: We evaluate our method using three baseline models that encompass diverse document retrieval strategies: (i) BM25 Robertson and Zaragoza (2009), a classical sparse retrieval model; (ii) DPR Karpukhin et al. (2020), a dense retrieval approach based on a dual-encoder architecture that independently encodes queries and passages, optimizing their embeddings via contrastive loss; and (iii) BGE Xiao et al. (2024), which combines dense, sparse, and multi-vector retrieval using a self-knowledge distillation framework. To comprehensively examine the impact of KG-CQR on retrieval performance, we further compare KG-CQR with two representative approaches in this research field: Query Expansion Chen et al. (2024) and HyDE Gao et al. (2023).
Benchmark Datasets: We evaluate our method on two recent and widely used benchmark datasets: (i) RAGBench Friel et al. (2024), which spans five distinct industry-specific domains. We use its test set comprising approximately 11,000 instances for retrieval evaluation; and (ii) Multihop-RAG Tang and Yang (2024), which includes a knowledge base, a large set of multi-hop queries, corresponding ground-truth answers, and supporting evidence, totaling 2,556 queries for evaluation. For each dataset, the corresponding KG is constructed in three steps, as outlined in Section 3.1.1, using the LLaMA-3.3-70B model.
4.2 Main Results
Table 1 presents the evaluation results of the retrieval process on both datasets. Retrieval accuracy is evaluated using standard metrics such as mean Average Precision (mAP) and Recall@k, where . The reported results use (Equation 9), which was found to yield the best performance (the selection of this value is further discussed in Appendix 6.2.2). From the results, we draw the following observations:
i) Retrieval Performance: KG-CQR significantly improves retrieval performance across various retrieval backbones. On the RAGBench dataset, KG-CQR + BGE achieves the best performance overall, with an mAP of 0.542 and Recall@25 of 0.675, outperforming both the baseline models and the HyDE-enhanced variants. On the more challenging MultiHop-RAG dataset, KG-CQR + BM25 achieves the highest recall metrics (e.g., Recall@25 = 0.532), demonstrating KG-CQR’s effectiveness compared to traditional methods.
ii) Impact of Query Expansion (QE): Compared with their respective baselines, QE-augmented models generally underperform across both datasets. For instance, QE + BM25 (mAP = 0.280 on RAGBench, 0.124 on MultiHop-RAG) performs notably worse than plain BM25, and similar degradations are observed for DPR and BGE backbones. This suggests that naive query expansion often introduces noise and semantic drift, which outweighs the potential benefits of richer lexical coverage. In contrast, KG-CQR achieves consistent improvements by leveraging structured knowledge for contextually grounded reformulations instead of unguided expansions.
iii) Contextual Accuracy: The comparatively lower performance of HyDE relative to its baselines indicates potential limitations in relying extensively on synthetic queries generated by LLMs. Specifically, while HyDE offers a straightforward method for enhancing contextual understanding, its effectiveness is notably sensitive to the contextual reliability of the generated content. This highlights the constraints of inadequately grounded synthetic information in retrieval tasks.
iv) Diverse Benchmarks: Although models like BGE perform well on relatively straightforward datasets such as RAGBench, more complex datasets like MultiHop-RAG demand advanced reasoning capabilities. KG-CQR demonstrates robustness in such settings by effectively handling multi-hop reasoning and maintaining strong performance. These results highlight the importance of retrieval frameworks that integrate contextual understanding and structured knowledge to perform consistently across diverse and complex benchmarks.
4.3 Detailed Analysis
4.3.1 Impact of LLM Backbone
Table 2 illustrates the retrieval performance of KG-CQR when paired with different sizes of language models, using BGE as the underlying retrieval method.
Specifically, the LLaMA-3.3-70B model achieves the highest performance across nearly all metrics; however, the performance differences between the 8B and 70B variants are relatively modest, suggesting diminishing returns as model size increases. These findings indicate that while larger models do offer performance advantages, KG-CQR remains effective even with relatively smaller backbones such as LLaMA-3.2-3B and LLaMA-3.1-8B. This highlights KG-CQR’s practicality for resource-constrained environments, offering a favorable trade-off between retrieval performance and computational cost.
4.3.2 Ablation Study
Table 3 presents an ablation study evaluating the contribution of two core components of KG-CQR: the Textual Triplet Representation (TTR) for extracting subgraph and the Subgraph Completion (Sub.Comp.).
As shown in the results, removing TTR (Equation 2) leads to the most pronounced drop in performance (e.g., Recall@25 decreases from 0.675 to 0.641), underscoring the importance of TTR in accurately extracting relevant subgraphs that preserve semantic alignment with the query. This confirms that converting structured KG information into textual form plays a critical role in aligning the knowledge with the retrieval task. Similarly, omitting the Subgraph Completion module also results in a notable performance degradation, though less severe than removing TTR. This suggests that while the initial subgraph extraction is vital, enriching the subgraph context via completion further improves the model’s ability to retrieve relevant documents.
4.3.3 Multi-Step Retrieval for RAG Task
We evaluate the effectiveness of KG-CQR in multi-step reasoning RAG tasks by integrating its retrieval outputs into the IRCoT framework Trivedi et al. (2023).
To assess the generalizability of KG-CQR, experiments were conducted with three LLMs of varying sizes across multiple datasets. The evaluation highlights the role of KG-CQR in enhancing retrieval performance for reasoning-intensive RAG tasks. We randomly sampled 500 examples from the RAGBench test set and evaluated results using F1, GPT-Score () Fu et al. (2024), and the average number of reasoning steps (). In addition, two widely used multi-hop QA benchmarks, HotpotQA Yang et al. (2018) and MuSiQue Trivedi et al. (2022), were included in the evaluation. GPT-Score was computed using GPT-4o through the OpenAI API, based on its performance on the Judge LLM leaderboard222https://huggingface.co/spaces/AtlaAI/judge-arena. As shown in Table 4, several key insights can be drawn: i) KG-CQR substantially improves retrieval quality across datasets: On RAGBench, KG-CQR + BM25 consistently outperforms BM25, with performance gains across all LLM sizes (e.g., F1 improves from 0.393 to 0.410 on LLaMA-3.1-8B). Similar improvements are observed on HotpotQA, where KG-CQR yields a significant gain for the largest model (F1 = 0.700 vs. 0.663). The effect is most pronounced on MuSiQue, where KG-CQR + BM25 achieves F1 = 0.489 with LLaMA-3.3-70B compared to 0.374 for BM25, underscoring its effectiveness for complex multi-hop reasoning (results with the BGE retriever are provided in Appendix 6.2.1). ii) Contextualized reformulations reduce reasoning iterations: KG-CQR consistently decreases the average number of reasoning steps. For example, on HotpotQA with LLaMA-3.3-70B, the number of steps is reduced from 1.465 to 1.280. This suggests that knowledge-grounded query reformulations provide more accurate intermediate evidence, enabling models to converge on answers with fewer redundant reasoning cycles. iii) Cross-model scalability and robustness: Performance gains are observed across different LLM sizes, highlighting the adaptability of KG-CQR. Notably, the improvements are more pronounced on datasets requiring deeper reasoning (e.g., RAGBench and MuSiQue), indicating that KG-CQR effectively complements LLM reasoning by supplying better-targeted retrieval contexts.
4.3.4 Retrieval Latency
Figure 4 compares the relative retrieval latency of the baseline HyDE with three KG-CQR variants:
i) KG-CQR w/ Naive-BFS): use basic BFS algorithm for subgraph completion; ii) KG-CQR w/o Sub.Comp.: removes the subgraph completion module entirely; iii) KG-CQR(ours): utilizes heuristic-guided Beam Search for more efficient subgraph completion. The analysis confirms that the proposed KG-CQR with Beam Search strikes an optimal balance between retrieval efficiency and reasoning capability. While KG-CQR without subgraph completion is the fastest, KG-CQR with Beam Search provides a more scalable and semantically expressive alternative with only modest additional cost. In contrast, HyDE and naive BFS approaches incur higher latency, making them less favorable for real-time or large-scale applications.
4.3.5 Complementarity with Other Methods
While KG-based methods such as GraphRAG Edge et al. (2024) and HippoRAG Gutierrez et al. (2024) emphasize corpus-centric expansion, KG-CQR focuses on query-centric reformulation. To assess their complementarity, we integrated KG-CQR with HippoRAG2 Gutiérrez et al. (2025), as reported in Table 5.
The integration yields consistent improvements (e.g., mAP +0.027, Recall@25 +0.029), showing that KG-CQR complements corpus-centric approaches by aligning queries more effectively with relevant evidence. The observed improvements suggest that combining query-centric and corpus-centric KG-based techniques yields a more comprehensive retrieval framework, capable of strengthening both contextual grounding and coverage in multi-hop QA tasks.
5 Conclusion
This study presented KG-CQR, a novel retrieval framework that leverages knowledge graphs to enhance contextual query retrieval in RAG systems. By combining subgraph extraction and completion with structured relation representations, KG-CQR enriches query semantics and improves alignment with document embeddings. Experiments on RAGBench and MultiHop-RAG show consistent gains in retrieval performance, while analyses highlight the critical role of textual triplet representation and subgraph completion. Further evaluations on multi-step reasoning RAG tasks indicate improved accuracy while reducing redundant reasoning steps.
Limitations
Although KG-CQR demonstrates promising results, several limitations warrant consideration for future improvements:
KG Construction Challenges: The construction of the corpus-centric knowledge graph relies heavily on external LLMs, such as LLaMA-3.3-70B, for entity and relation extraction. This process is susceptible to errors in named entity recognition (NER), relation extraction (RE), and entity linking (EL), which can propagate through the pipeline and affect the quality of the extracted subgraph. In domains with sparse or noisy unstructured data, the resulting KG may lack completeness or accuracy, limiting the effectiveness of KG-CQR.
Scalability of Subgraph Extraction: The subgraph extraction process, while effective, can be computationally intensive for large-scale knowledge graphs with millions of triples. Sentence-level semantic similarity computation with textual triplet representations (TTRs) increases computational overhead, potentially limiting scalability in real-time retrieval systems or resource-constrained environments.
Limited Evaluation Scope: The current evaluation of KG-CQR is restricted to several benchmark datasets. While these datasets are diverse, they may not fully reflect the range and complexity of real-world retrieval scenarios. To more rigorously assess the generalizability of the proposed framework, future work should include evaluations on additional datasets, particularly those that involve cross-lingual settings or highly domain-specific knowledge.
6 Appendix
6.1 GPT-score Criteria
Following the work in Fu et al. (2024), we define the GPT-Score with three criteria for the measurement as follows:
- •
Correctness: alignment of the generated answer with the reference answer
- •
Faithfulness: whether the generated answer remains true to the given context
- •
Relevance: how well the retrieved context and the generated answer address the query
6.2 Comprehensive Experimential Results
6.2.1 Multi-Step Retrieval for RAG with BGE
Building on the earlier analysis (Table 4), Table 6 presents results for multi-step reasoning RAG performance using BGE as the retrieval baseline, along with KG-CQR. The key observations are as follows:
i) Dense retrieval outperforms sparse retrieval across all model sizes: BGE consistently outperforms BM25 in terms of F1 score and GPT-Score, which demonstrates that dense retrieval via BGE retrieves more semantically relevant contexts than BM25, supporting more accurate and efficient reasoning; ii) KG-CQR improves both BM25 and BGE retrieval: Adding KG-CQR on top of both BM25 and BGE enhances performance by enriching the query with context-relevant knowledge. Although the improvement margin is narrower in the BGE setting, KG-CQR still consistently enhances performance, highlighting its generality across retrieval methods.
6.2.2 Fusion Embeddings Experiments
Table 7 shows the comprehensive evaluation on the value of to fuse the input query and context embeddings (Equation 9).
As results, setting consistently yields the best overall performance.
Sequentially, Table 8 and Table 9 demonstrate the full experimental results across various backbones, including LLaMA-3.2-3 B and LLaMA-3.1-8B, respectively. Similar to the results on LLaMA-3.3-70B, the KG-CQR + BGE backbone at = 0.7 yields the best performance for both models, in which LLaMA-3.1-8B shows slight improvements over LLaMA-3.2-3B, particularly in MultiHop-RAG tasks.
6.3 BFS with Beam Search Algorithm
Algorithm 2 presents the pseudocode for the BFS with Beam Search. Given the hyperparameter Beam width (e.g., equal to 3), the algorithm explores explicit paths (triplets) that represent meaningful connections between entities within the given subgraph.
6.4 Error Analysis with Examples
To better understand the behavior of the KG-CQR, we performed a qualitative error analysis on six representative multi-hop queries from the MultiHop-RAG dataset with three corrected retrievals (Table 10) and three with incorrect retrievals (Table 11). We compared the outputs of KG-CQR against those of HyDE and the human-annotated Ground Truth.
Based on the results in Table 10, there are several assumptions as follows: i) KG-CQR demonstrates strong performance in disambiguating entities. For instance, in the query “Did one of CBS’s performers create a scandal?”, KG-CQR retrieves documents specifically related to the mentioned performer and event. This shows that incorporating knowledge graph information improves precision by retrieving documents more closely aligned with the query context; ii) In time-sensitive queries like “Which events occurred in Week 12?”, KG-CQR accurately retrieves temporally relevant content, whereas HyDE often returns general or loosely connected documents. This suggests that KG signals enhance temporal grounding in multi-hop retrieval tasks; iii) For bridge-type queries that require chaining multiple pieces of information (e.g., “Does the article from Wendy refer to the same city?”), KG-CQR performs well by retrieving documents that correctly capture the intermediate and final entities. This indicates improved multi-hop coherence over baseline methods.
Despite these strengths, the proposed KG-CQR shows notable limitations in the following areas (Table 11): i) Contextual Drift and Irrelevant Retrievals: KG-CQR struggles with queries requiring fine-grained temporal reasoning, comparative analysis, or interpretation of subjective content. These limitations stem from insufficient temporal representation and the lack of deep semantic modeling needed to capture nuanced relationships and contrasting viewpoints; ii) Limited Multi-hop Coherence: For queries requiring reasoning across multiple documents, KG-CQR sometimes retrieved disconnected evidence, failing to form a complete answer path.
6.5 Prompt Template
For better reproducibility, we present all prompt templates in the appendix. Below is a quick reference list outlining the prompt templates and their usages:
- •
Figure 5: Prompt the task instruction for KG construction.
- •
Figure 6: Prompt the task instruction for textual triplet representation.
- •
Figure 7: Prompt the task instruction for filtering irrelevant triplets.
- •
Figure 8: Prompt the task instruction for contextual generation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abu-Rasheed et al. (2024) Hasan Abu-Rasheed, Christian Weber, and Madjid Fathi. 2024. Knowledge graphs as context sources for llm-based explanations of learning recommendations . In IEEE Global Engineering Education Conference, EDUCON 2024, Kos Island, Greece, May 8-11, 2024 , pages 1–5. IEEE. · doi ↗
- 2Azad and Deepak (2019) Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion techniques for information retrieval: A survey . Inf. Process. Manag. , 56(5):1698–1735. · doi ↗
- 3Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguisti · doi ↗
- 4Bruch et al. (2024) Sebastian Bruch, Siyu Gai, and Amir Ingber. 2024. An analysis of fusion functions for hybrid retrieval . ACM Trans. Inf. Syst. , 42(1):20:1–20:35. · doi ↗
- 5Bui et al. (2021) Manh-Ha Bui, Toan Tran, Anh Tran, and Dinh Q. Phung. 2021. Exploiting domain-specific features to enhance domain generalization . In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual , pages 21189–21201.
- 6Chan et al. (2024) Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: learning to refine queries for retrieval augmented generation . Co RR , abs/2404.00610. · doi ↗
- 7Chen et al. (2024) Xinran Chen, Xuanang Chen, Ben He, Tengfei Wen, and Le Sun. 2024. Analyze, generate and refine: Query expansion with llms for zero-shot open-domain QA . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages 11908–11922. Association for Computational Linguistics. · doi ↗
- 8Ding et al. (2024) Wentao Ding, Jinmao Li, Liangchuan Luo, and Yuzhong Qu. 2024. Enhancing complex question answering over knowledge graphs through evidence pattern retrieval . In Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024 , pages 2106–2115. ACM. · doi ↗
