Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
Nay Myat Min, Long H. Pham, Jun Sun

TL;DR
This paper introduces Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that detects various LLM misbehaviors by analyzing inter-layer hidden states, achieving high detection rates with minimal false positives.
Contribution
LCF is a novel, model-agnostic runtime safety layer that does not require retraining or reference models, effectively detecting backdoors, jailbreaks, and prompt injections.
Findings
LCF reduces backdoor attack success rate below 1% on several models.
Detects 92-100% of jailbreaks across multiple techniques.
Flags 100% of prompt injections with low false positive rate.
Abstract
Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
