Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

Andre Bacellar

arXiv:2603.28886·cs.IR·April 29, 2026

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

Andre Bacellar

PDF

TL;DR

This paper introduces PhaseGraph, a score calibration method that aligns heterogeneous retrieval scores for multi-hop question answering, improving retrieval accuracy across multiple datasets.

Contribution

It proposes a novel percentile-rank normalization technique for stable fusion of vector and graph scores in multi-hop QA retrieval.

Findings

01

Calibrated fusion improves last-hop retrieval accuracy on MuSiQue and 2WikiMultiHopQA.

02

Percentile-rank normalization is more robust than min-max normalization.

03

Boltzmann weighting performs comparably to linear fusion after calibration.

Abstract

Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.