Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA
Andre Bacellar

TL;DR
This paper introduces PhaseGraph, a score calibration method that aligns heterogeneous retrieval scores for multi-hop question answering, improving retrieval accuracy across multiple datasets.
Contribution
It proposes a novel percentile-rank normalization technique for stable fusion of vector and graph scores in multi-hop QA retrieval.
Findings
Calibrated fusion improves last-hop retrieval accuracy on MuSiQue and 2WikiMultiHopQA.
Percentile-rank normalization is more robust than min-max normalization.
Boltzmann weighting performs comparably to linear fusion after calibration.
Abstract
Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
