From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability
Jihoon Moon

TL;DR
This paper introduces a control-theoretic framework to interpret neural networks by modeling them as nonlinear state space systems, enabling analysis of neuron importance and internal dynamics for improved interpretability.
Contribution
It develops a novel method that applies control theory concepts to neural networks, providing a principled way to analyze neuron roles and internal modes.
Findings
Controllability measures neuron excitation ease.
Observability assesses neuron influence on output.
Hankel singular values rank internal modes by energy.
Abstract
Deep neural networks achieve state of the art performance but remain difficult to interpret mechanistically. In this work, we propose a control theoretic framework that treats a trained neural network as a nonlinear state space system and uses local linearization, controllability and observability Gramians, and Hankel singular values to analyze its internal computation. For a given input, we linearize the network around the corresponding hidden activation pattern and construct a state space model whose state consists of hidden neuron activations. The input state and state output Jacobians define local controllability and observability Gramians, from which we compute Hankel singular values and associated modes. These quantities provide a principled notion of neuron and pathway importance: controllability measures how easily each neuron can be excited by input perturbations, observability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Adversarial Robustness in Machine Learning · Neural Networks and Reservoir Computing
