TL;DR
This paper systematically studies how joint multi-task training can significantly reduce model size requirements by leveraging task compatibility and shared structures, with implications for efficient curriculum design.
Contribution
It provides a systematic analysis of how task pairing influences model capacity reduction in multi-task learning, highlighting the importance of task compatibility and shared primitives.
Findings
Certain task pairings reduce model size by 2-7 times.
Successful joint training induces structured representations in models.
Pretraining on easy tasks enables learning complex tasks at smaller sizes.
Abstract
Multi-task learning improves generalization, but when does it reduce the model capacity required to learn? We provide a systematic study of how joint training affects the learning transition, the minimum model size at which a task can be learned, using nested arithmetic (ListOps) and permutation groups as controlled testbeds. Certain task pairings dramatically reduce model size requirements: combining easy operations (MAX, MIN, PROD) with hard ones (modular addition, permutation products) enables learning with 2-7 times fewer parameters. Crucially, we also identify when synergies fail: pairing structurally similar hard tasks (e.g., ADD with alternating-sign NADD) provides no benefit, nor does pairing tasks lacking shared computational primitives. PCA of learned embeddings reveals that successful joint training induces structured number representations (ordering, parity, modular…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper presents several interesting phenomena - The embeddings of jointly trained models separate even and odd numbers on ADD and PROD tasks. - The experiment on shuffled SUM suggests that the benefit of joint training emerges only when the easy and hard tasks share underlying numerical properties (Figure 3). - The transfer learning experiment demonstrates that pretraining on simpler tasks and then transferring to harder ones is an effective curriculum. (Figure 5)
- The novelty and significance of the work may be limited. - The general observation that joint training or curriculum learning benefits language models, including for arithmetic tasks, has been reported previously. The most related prior work I know is [1]; it would be helpful for the authors to clarify how their approach and findings differ from that work. - While the finding that joint training leads to more structured embeddings is interesting, the paper does not analyze how these embedd
Quality and Originality: The paper presents a systematic empirical exploration of how joint task training impacts the learning thresholds of small transformer models and attempts to reframe the understanding of scaling laws.
1. Lack of Novelty. Results are not surprising and authors need to discuss the multi-task learning (i.e., joint training can improve the performance is a common wisdom) and curriculum learning literature. Authors in the introduction discuss scaling laws but they do not provide the exact formulation of laws that contain curriculum learning factors. No precise alternative of KC complexity as well. 2. Small-scale experiments do not support motivations and hurt significance: Authors only conduct ex
1) The paper studies the effect of joint task learning with multiple experiments, which cover a lot of different aspects of the learning problem (eg. prime v/s non-prime moduli, shuffling of order for ADD, etc.) 2) Interpretability done with embedding vectors was useful. In particular, the restricted embedding hypothesis was a strong evidence for the claims made about the utility of joint training. 3) Experiments on permutation groups present an interesting addition, with a lot of scope for futu
1) There is some literature which talks about compositional and/or multi-step mathematical reasoning, the paper was missing references to these [1, 2, 3]. Although the current paper has several important experiments which were missing or not considered in the papers that are mentioned below, it will be useful for the authors to devote some space to discussing these differences in the main text. 2) The last section on 'Discussion and Limitations' doesn't really discuss limitations. For example,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
