Operational Memory Architecture for Kubernetes:Preserving Causal Context Across the Evidence Horizon
Shamsher Khan

TL;DR
This paper presents the Operational Memory Architecture (OMA) for Kubernetes, which preserves critical causal failure evidence across event rotations, enabling better diagnosis during high-frequency crash loops.
Contribution
It introduces a novel architecture and open-source implementation that captures and reconstructs causal failure evidence in Kubernetes clusters, addressing the evidence horizon problem.
Findings
Causal edges built with mean latency below 1 ms
Collector processes ~2.8 events/sec with under 10 MB memory
Effective evidence preservation during crash loops
Abstract
Kubernetes clusters generate rich operational events during pod lifecycle transitions, yet the platform's native event retention model discards the most diagnostically valuable context. The LastTerminationState field, which records a container's last failure, is overwritten shortly after a pod restart. We define this as the evidence horizon. During high-frequency crash loops, this horizon may be crossed multiple times before inspection, permanently losing critical evidence. This paper introduces the Operational Memory Architecture (OMA) to preserve causal failure evidence before event rotation. OMA encodes evidence retention and causal reconstruction as explicit architectural requirements. It captures operational events into causal chains using three patterns: P001 (OOMKill chain), P002 (ConfigMap variable misconfiguration), and P003 (ConfigMap volume mount propagation). We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
