Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations
Omar Claflin

TL;DR
This paper introduces a dual encoding framework in neural networks, revealing separate feature identity and integration spaces, and demonstrates improved interpretability and behavior modeling through joint training architectures.
Contribution
It proposes a novel dual encoding hypothesis and develops joint-training architectures that capture feature identity and integration simultaneously, advancing interpretability of neural representations.
Findings
Joint training improves reconstruction by 41.3%
Integration features show sensitivity to experimental manipulations
Nonlinear components achieve 16.5% standalone improvements
Abstract
Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
