Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

Ally Qin; Jian Wan; Sarat Mudunuri; Srinivasan Manoharan

arXiv:2604.19767·cs.LG·April 23, 2026

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

Ally Qin, Jian Wan, Sarat Mudunuri, Srinivasan Manoharan

PDF

TL;DR

This study demonstrates that speculative decoding with EAGLE3 significantly improves inference efficiency for PayPal's Commerce Agent, achieving up to 50% GPU cost savings without quality loss.

Contribution

It provides an empirical evaluation of speculative decoding on EAGLE3 with fine-tuned Nemotron models, showing substantial throughput and latency improvements at reduced hardware costs.

Findings

01

Gamma=3 yields 22-49% throughput improvement.

02

Acceptance rate remains stable at ~35.5% for gamma=3.

03

Single H100 with speculative decoding matches or exceeds dual H100 NIM performance.

Abstract

We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.