Enhancing OLAP Resilience at LinkedIn
Praveen Chaganlal, Jia Guo, Vivek Vaidyanathan, Dino Occhialini, Sonam Mandal, Subbu Subramaniam, Siddharth Teotia, Tianqi Li, Xiaxuan Gao, Florence Zhang

TL;DR
This paper presents a comprehensive set of resiliency mechanisms for OLAP datastores, specifically Apache Pinot at LinkedIn, to ensure stable query latency and high availability during failures and load changes.
Contribution
It introduces novel resiliency techniques including workload isolation, impact-free rebalancing, fault-aware replica placement, and adaptive server selection for OLAP systems.
Findings
Predictable tail latency with <1% overhead using QWI.
High availability achieved during routine operations.
Resiliency framework successfully deployed in production at LinkedIn.
Abstract
Real-time OLAP datastores are critical infrastructure for modern enterprises, powering interactive analytics on petabyte-scale datasets with subsecond latency requirements. As these systems become integral to service architectures, maintaining strict SLAs under failures, load spikes, and cluster changes is as important as raw performance. We present a set of resiliency mechanisms developed for Apache Pinot at LinkedIn, applicable to modern OLAP systems broadly. We introduce Query Workload Isolation (QWI), which provides workload-level CPU and memory budgeting across Pinot's broker and server tiers via fine-grained resource accounting and sub-millisecond enforcement, delivering predictable tail latency and fairness with under 1% overhead. We present Impact-Free Rebalancing for SLA-safe data movement during routine operations (e.g., upgrades, scale-out, and recovery), and Maintenance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Advanced Database Systems and Queries
