Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure
Besmira Nushi, Ece Kamar, Eric Horvitz

TL;DR
This paper introduces Pandora, a hybrid human-machine approach for detailed analysis and explanation of system failures in complex AI systems, improving understanding beyond traditional aggregate metrics.
Contribution
Pandora provides a novel set of tools combining human insights and system data to characterize failures in multi-component machine learning systems.
Findings
Pandora enables detailed failure analysis in image captioning systems.
Case study shows Pandora improves debugging and system understanding.
Hybrid methods reveal failure conditions related to input and architecture.
Abstract
As machine learning systems move from computer-science laboratories into the open world, their accountability becomes a high priority problem. Accountability requires deep understanding of system behavior and its failures. Current evaluation methods such as single-score error metrics and confusion matrices provide aggregate views of system performance that hide important shortcomings. Understanding details about failures is important for identifying pathways for refinement, communicating the reliability of systems in different settings, and for specifying appropriate human oversight and engagement. Characterization of failures and shortcomings is particularly complex for systems composed of multiple machine learned components. For such systems, existing evaluation methods have limited expressiveness in describing and explaining the relationship among input content, the internal states…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
