LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

Raja Gond; Aditya K Kamath; Ramachandran Ramjee; and Ashish Panwar

arXiv:2601.17768·cs.LG·February 2, 2026

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation

Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar

PDF

Open Access

TL;DR

LLM-42 introduces a scheduling-based method that achieves deterministic inference in large language models by verifying token outputs with minimal overhead, maintaining high throughput and flexibility.

Contribution

This work presents LLM-42, a novel approach that enforces determinism in LLM inference through verification and rollback, avoiding kernel redesigns and reducing overhead.

Findings

01

Achieves deterministic outputs with minimal performance impact.

02

Mostly reuses existing GPU kernels without modification.

03

Overhead is proportional to the amount of traffic requiring determinism.

Abstract

In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical Methods and Algorithms · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies