Swing-by Dynamics in Concept Learning and Compositional Generalization

Yongyi Yang; Core Francisco Park; Ekdeep Singh Lubana; Maya Okawa; Wei Hu; Hidenori Tanaka

arXiv:2410.08309·cs.LG·November 3, 2025

Swing-by Dynamics in Concept Learning and Compositional Generalization

Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of how diffusion models learn compositional concepts, introducing a simplified structured identity mapping task to explain empirical phenomena and predict learning dynamics.

Contribution

It introduces the SIM task as a theoretical abstraction to analyze concept learning dynamics and explains empirical observations in diffusion models through mathematical analysis.

Findings

01

Identifies a non-monotonic test loss behavior during early training phases.

02

Provides a theoretical framework that explains sequential and hierarchical generalization.

03

Validates predictions by training a real diffusion model on the SIM task.

Abstract

Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

This work studies a relevant problem: The learning dynamics of compositional generalization in diffusion models are important to understand how models can learn in a sample-efficient manner, how generalization can be achieved, or how training data should be curated, to name a few ways insights could be impactful. The paper is easy to follow for the most parts and builds on a prior line of work in this area. The non-monotonic training dynamics of a symmetric two-layer linear model are well expl

Weaknesses

In summary, I find the paper misses the mark, as the SIM task is, as far as I understand, a poor setting to study compositional generalization, and the insights on this simple task translate poorly to the training of a diffusion model, even for the simple toy setting that is used (which itself is approximating the compositional generalization of text-conditioned diffusion models that this paper aims to study). I will elucidate the issues I see with the setting and results below. As it stands, I

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is clearly written, with a logical flow that makes each insight and conclusion easy to follow. 2. The simplicity of the problem setup enhances the clarity and robustness of both the empirical observations and the theoretical contributions. 3. The diffusion model results are compelling, mirroring the behavior observed in simpler settings and offering explanations for phenomena noted in prior work.

Weaknesses

See the questions section for more information.

Reviewer 03Rating 6Confidence 3

Strengths

- Paper formalizes a proxy task (SIM) to model the learning dynamics of compositional generalization, and shows that real-world text-conditioned diffusion models exhibit similar behavior on a specific task. - Theoretical setup is clearly explained and theoretical conclusions are also well-elaborated. Limitations of theoretical results on the one-layer and symmetric two-layer linear models are also clearly discussed.

Weaknesses

- It is not at all apparent that $\hat{x}$ is "outside of the training distribution", since $p(x_k^{(p)} = \hat{x}) > 0$ for training data $x_k^{(p)}$, despite what is claimed on L174. Why not choose sigma such that $\sigma_p \geq 0$ while keeping $\sigma_{q \neq p} = 0$? Furthermore components of $\sigma$ seems to be as large as $2$ in Fig 1., with $\mu$ ranging from $0 - 2$. In such cases, the converse seems to hold -- that $x_k^{(p)}$ is very much in-distribution of the training data. - As a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion