Collie: Finding Performance Anomalies in RDMA Subsystems
Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong, Guo, and Danyang Zhuo

TL;DR
Collie is a comprehensive tool that systematically uncovers performance anomalies in RDMA subsystems by exploring application workload space and using simulated annealing to identify extreme performance behaviors, aiding in production readiness.
Contribution
This paper introduces Collie, a novel holistic approach that uses simulated annealing to find RDMA performance anomalies without hardware access, revealing new issues acknowledged by vendors.
Findings
Discovered 15 new performance anomalies in RDMA systems.
7 anomalies were fixed after reporting; all were acknowledged by vendors.
Validated Collie's effectiveness on RDMA-based applications like RPC and machine learning.
Abstract
High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Semiconductor materials and devices · Cloud Computing and Resource Management
