Marginals Before Conditionals

Mihir Sahasrabudhe

arXiv:2603.10074·cs.LG·March 12, 2026

Marginals Before Conditionals

Mihir Sahasrabudhe

PDF

Open Access

TL;DR

This paper presents a minimal neural network task isolating conditional learning, revealing how models first learn marginals with a plateau before transitioning sharply to full conditionals, influenced by dataset size and training dynamics.

Contribution

It introduces a minimal task to study conditional learning, demonstrating the dynamics of marginal and conditional acquisition and the stabilizing role of gradient noise and dataset size.

Findings

01

Models first learn the marginal P(A|B), creating a plateau at log K.

02

Gradient noise and batch size affect the transition timing.

03

Selector-routing heads emerge during the plateau, leading the transition.

Abstract

We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* {\eta} range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Quantum many-body systems