TL;DR
SpecHop is a framework that accelerates multi-hop retrieval tasks in language models by using continuous speculation with multiple threads, reducing latency while maintaining accuracy.
Contribution
It introduces a lossless speculation framework that asynchronously verifies predictions, enabling significant latency reduction without sacrificing correctness.
Findings
SpecHop reduces retrieval latency by up to 40%.
It closely matches theoretical latency gains predicted by the framework.
Empirical results validate the effectiveness of SpecHop on multi-hop retrieval tasks.
Abstract
Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
