ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
Daeyeon Son

TL;DR
ProbeLogits introduces a kernel-level primitive for LLM inference that classifies agent actions efficiently and securely, enabling OS-level governance without learned parameters.
Contribution
This work presents ProbeLogits, a novel kernel-level operation for classifying LLM logits directly within the OS kernel, enhancing security and performance.
Findings
Achieves 97-99% block rate on HarmBench with proper verbalizer.
Matches or exceeds Llama Guard 3's F1 score on ToxicChat.
Operates approximately 2.5x faster than existing classifiers in hosted environment.
Abstract
An OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. I evaluate ProbeLogits across three base models (Qwen 2.5-7B, Llama 3 8B, Mistral 7B) on three external benchmarks: HarmBench, XSTest, and ToxicChat. On HarmBench non-copyright (n=300), all three models reach 97-99% block rate with the right verbalizer. On ToxicChat (n=1,000), ProbeLogits achieves F1 parity-or-better against Llama Guard 3 in the same hosted environment: the strongest configuration (Qwen 2.5-7B Safe/Dangerous, alpha=0.0) reaches F1=0.812 with bootstrap 95% CIs disjoint from LG3 (+13.7pp significant); Llama 3 S/D matches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
