Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models
Ally Qin, Jian Wan, Sarat Mudunuri, Srinivasan Manoharan

TL;DR
This study demonstrates that speculative decoding with EAGLE3 significantly improves inference efficiency for PayPal's Commerce Agent, achieving up to 50% GPU cost savings without quality loss.
Contribution
It provides an empirical evaluation of speculative decoding on EAGLE3 with fine-tuned Nemotron models, showing substantial throughput and latency improvements at reduced hardware costs.
Findings
Gamma=3 yields 22-49% throughput improvement.
Acceptance rate remains stable at ~35.5% for gamma=3.
Single H100 with speculative decoding matches or exceeds dual H100 NIM performance.
Abstract
We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
