
TL;DR
This paper presents a minimal neural network task isolating conditional learning, revealing how models first learn marginals with a plateau before transitioning sharply to full conditionals, influenced by dataset size and training dynamics.
Contribution
It introduces a minimal task to study conditional learning, demonstrating the dynamics of marginal and conditional acquisition and the stabilizing role of gradient noise and dataset size.
Findings
Models first learn the marginal P(A|B), creating a plateau at log K.
Gradient noise and batch size affect the transition timing.
Selector-routing heads emerge during the plateau, leading the transition.
Abstract
We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* {\eta} range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Quantum many-body systems
