Understanding Task Representations in Neural Networks via Bayesian Ablation

Andrew Nam; Declan Campbell; Thomas Griffiths; Jonathan Cohen; Sarah-Jane Leslie

arXiv:2505.13742·cs.LG·April 7, 2026

Understanding Task Representations in Neural Networks via Bayesian Ablation

Andrew Nam, Declan Campbell, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie

PDF

TL;DR

This paper introduces a Bayesian-inspired probabilistic framework to interpret neural network representations, providing tools to analyze their causal roles and properties like distributedness and complexity.

Contribution

It presents a novel probabilistic approach for understanding latent task representations in neural networks, addressing interpretability challenges.

Findings

01

Provides a suite of metrics for analyzing neural representations.

02

Defines a distribution over units to infer their causal contributions.

03

Illuminates properties like distributedness and manifold complexity.

Abstract

Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.