From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability

Jihoon Moon

arXiv:2511.12852·cs.LG·November 18, 2025

From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability

Jihoon Moon

PDF

Open Access

TL;DR

This paper introduces a control-theoretic framework to interpret neural networks by modeling them as nonlinear state space systems, enabling analysis of neuron importance and internal dynamics for improved interpretability.

Contribution

It develops a novel method that applies control theory concepts to neural networks, providing a principled way to analyze neuron roles and internal modes.

Findings

01

Controllability measures neuron excitation ease.

02

Observability assesses neuron influence on output.

03

Hankel singular values rank internal modes by energy.

Abstract

Deep neural networks achieve state of the art performance but remain difficult to interpret mechanistically. In this work, we propose a control theoretic framework that treats a trained neural network as a nonlinear state space system and uses local linearization, controllability and observability Gramians, and Hankel singular values to analyze its internal computation. For a given input, we linearize the network around the corresponding hidden activation pattern and construct a state space model whose state consists of hidden neuron activations. The input state and state output Jacobians define local controllability and observability Gramians, from which we compute Hankel singular values and associated modes. These quantities provide a principled notion of neuron and pathway importance: controllability measures how easily each neuron can be excited by input perturbations, observability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Adversarial Robustness in Machine Learning · Neural Networks and Reservoir Computing