Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Prathamesh Vasudeo Naik, Naresh Dintakurthi, Yue Wang

TL;DR
This paper presents a specialized LLMOps stack optimized for fraud and AML compliance workloads, improving efficiency, latency, and throughput through workload-aware tuning and system design.
Contribution
It introduces a novel workload-aware LLM serving architecture tailored for compliance tasks, combining multiple optimization techniques and quality gates.
Findings
Throughput increased from 612-650 to 3,600 requests/hour.
P99 latency reduced from 31-38 seconds to 6.4-8.7 seconds.
GPU utilization improved from 12% to 78%.
Abstract
Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
