Swing-by Dynamics in Concept Learning and Compositional Generalization
Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka

TL;DR
This paper provides a theoretical analysis of how diffusion models learn compositional concepts, introducing a simplified structured identity mapping task to explain empirical phenomena and predict learning dynamics.
Contribution
It introduces the SIM task as a theoretical abstraction to analyze concept learning dynamics and explains empirical observations in diffusion models through mathematical analysis.
Findings
Identifies a non-monotonic test loss behavior during early training phases.
Provides a theoretical framework that explains sequential and hierarchical generalization.
Validates predictions by training a real diffusion model on the SIM task.
Abstract
Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the…
Peer Reviews
Decision·ICLR 2025 Poster
This work studies a relevant problem: The learning dynamics of compositional generalization in diffusion models are important to understand how models can learn in a sample-efficient manner, how generalization can be achieved, or how training data should be curated, to name a few ways insights could be impactful. The paper is easy to follow for the most parts and builds on a prior line of work in this area. The non-monotonic training dynamics of a symmetric two-layer linear model are well expl
In summary, I find the paper misses the mark, as the SIM task is, as far as I understand, a poor setting to study compositional generalization, and the insights on this simple task translate poorly to the training of a diffusion model, even for the simple toy setting that is used (which itself is approximating the compositional generalization of text-conditioned diffusion models that this paper aims to study). I will elucidate the issues I see with the setting and results below. As it stands, I
1. The paper is clearly written, with a logical flow that makes each insight and conclusion easy to follow. 2. The simplicity of the problem setup enhances the clarity and robustness of both the empirical observations and the theoretical contributions. 3. The diffusion model results are compelling, mirroring the behavior observed in simpler settings and offering explanations for phenomena noted in prior work.
See the questions section for more information.
- Paper formalizes a proxy task (SIM) to model the learning dynamics of compositional generalization, and shows that real-world text-conditioned diffusion models exhibit similar behavior on a specific task. - Theoretical setup is clearly explained and theoretical conclusions are also well-elaborated. Limitations of theoretical results on the one-layer and symmetric two-layer linear models are also clearly discussed.
- It is not at all apparent that $\hat{x}$ is "outside of the training distribution", since $p(x_k^{(p)} = \hat{x}) > 0$ for training data $x_k^{(p)}$, despite what is claimed on L174. Why not choose sigma such that $\sigma_p \geq 0$ while keeping $\sigma_{q \neq p} = 0$? Furthermore components of $\sigma$ seems to be as large as $2$ in Fig 1., with $\mu$ ranging from $0 - 2$. In such cases, the converse seems to hold -- that $x_k^{(p)}$ is very much in-distribution of the training data. - As a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
