Citation-Grounded Code Comprehension: Preventing LLM Hallucination Through Hybrid Retrieval and Graph-Augmented Context
Jahidul Arafat

TL;DR
This paper proposes a hybrid retrieval approach combining textual, semantic, and structural analysis to improve citation accuracy and prevent hallucinations in LLM-based code comprehension tools, validated on Python repositories.
Contribution
It introduces a hybrid retrieval system with graph expansion that significantly enhances citation accuracy and evidence discovery in code comprehension, addressing hallucination issues.
Findings
Achieved 92% citation accuracy with zero hallucinations.
Hybrid retrieval outperforms single modality baselines by 14-18%.
Discovered cross file evidence in 62% of architectural queries.
Abstract
Large language models have become essential tools for code comprehension, enabling developers to query unfamiliar codebases through natural language interfaces. However, LLM hallucination, generating plausible but factually incorrect citations to source code, remains a critical barrier to reliable developer assistance. This paper addresses the challenges of achieving verifiable, citation grounded code comprehension through hybrid retrieval and lightweight structural reasoning. Our work is grounded in systematic evaluation across 30 Python repositories with 180 developer queries, comparing retrieval modalities, graph expansion strategies, and citation verification mechanisms. We find that challenges of citation accuracy arise from the interplay between sparse lexical matching, dense semantic similarity, and cross file architectural dependencies. Among these, cross file evidence discovery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Scientific Computing and Data Management
