On Misbehaviour and Fault Tolerance in Machine Learning Systems
Lalli Myllyaho, Mikko Raatikainen, Tomi M\"annist\"o, Jukka K., Nurminen, Tommi Mikkonen

TL;DR
This paper investigates fault tolerance in ML systems, highlighting common design patterns and emphasizing the need for mature engineering practices to improve reliability and security in adaptive ML applications.
Contribution
It provides a conceptual framework for understanding ML misbehaviour, identifies emerging fault-tolerant design patterns, and underscores the field's immaturity and need for further development.
Findings
Monitoring input data and outputs for fault detection
Using multiple models and fallback strategies for robustness
Design patterns are emerging but not yet widely adopted
Abstract
Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability, such as reliability and security, of these systems. Systems can be tested and monitored, but this does not provide protection against faults and failures in adapted ML systems themselves. We studied software designs that aim at introducing fault tolerance in ML systems so that possible problems in ML components of the systems can be avoided. The research was conducted as a case study, and its data was collected through five semi-structured interviews with experienced software architects. We present a conceptualisation of the misbehaviour of ML systems, the perceived role of fault tolerance, and the designs used. Common patterns to incorporating ML…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Distributed systems and fault tolerance
