CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho, Jinsol Park, Patrick Wang

TL;DR
CREST is a redesigned datastore for speculative decoding that stores a selective subset of n-grams, reducing storage needs while maintaining or improving decoding performance on benchmark tasks.
Contribution
CREST introduces a method to store only the most common and smallest n-grams, making the datastore more compact and efficient without sacrificing accuracy.
Findings
CREST reduces storage space by 10.6-13.5x compared to REST.
CREST achieves 16.5-17.1% higher acceptance length than REST.
CREST maintains performance while significantly decreasing storage requirements.
Abstract
We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
