From Low Intrinsic Dimensionality to Non-Vacuous Generalization Bounds in Deep Multi-Task Learning
Hossein Zakerinia, Dorsa Ghobadi, Christoph H. Lampert

TL;DR
This paper demonstrates that deep multi-task learning models operate in a low-dimensional space, enabling the derivation of non-vacuous generalization bounds and showing high accuracy with fewer parameters than single-task models.
Contribution
It introduces a low-dimensional parametrization method for multi-task networks and derives the first non-vacuous generalization bounds for such models.
Findings
Multi-task models have smaller intrinsic dimensionality than single-task models.
High-accuracy multi-task solutions can be achieved with fewer parameters.
The approach yields the first non-vacuous generalization bounds for deep multi-task networks.
Abstract
Deep learning methods are known to generalize well from training to future data, even in an overparametrized regime, where they could easily overfit. One explanation for this phenomenon is that even when their *ambient dimensionality*, (i.e. the number of parameters) is large, the models' *intrinsic dimensionality* is small; specifically, their learning takes place in a small subspace of all possible weight configurations. In this work, we confirm this phenomenon in the setting of *deep multi-task learning*. We introduce a method to parametrize multi-task network directly in the low-dimensional space, facilitated by the use of *random expansions* techniques. We then show that high-accuracy multi-task solutions can be found with much smaller intrinsic dimensionality (fewer free parameters) than what single-task learning requires. Subsequently, we show that the low-dimensional…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is clearly written and well organized. * The proposed approach is conceptually sound and well motivated. * The computed PAC-Bayesian bounds are non-vacuous, which is a meaningful result for multi-task learning.
* The technical novelty appears limited. * The work largely builds upon the compression-based PAC-Bayesian framework of Lotfi et al. (2022) and extends it in a relatively straightforward manner. * The benefit of the proposed hierarchical parameter-sharing approach over existing methods is not clearly demonstrated. * It remains unclear how the method compares to simpler parameter-sharing schemes, such as decomposing the parameter (w) into shared and task-specific components as in Li et al. (2018)
- The authors clearly demonstrate that the multi-task setting results in higher compressibility and better bounds "in their particular setting". The rigorous formulation makes the multi-task efficiency measurable and comparable to the single task setting. - The amortized intrinsic dimensionality provides a simple defitition of 'how much sharing happens' and can be evaluated directly from experiments. - There is a clear scaling behavior as the number of tasks grows, where the amortized dimension
- The paper’s contribution is mostly conceptual and empirical as it does not show theoretically that if the tasks are related, e.g., measured in some form of distributional distance, the bounds will be tighter compared to the single task setting. It would be great to provide a sufficiency theoretical guarantee where if the task are related given a certain metric, the bounds will be theoretically improved. - There is no analysis to determine a priori whether the tasks are related or to identify
- The discussion seems to address natural questions that one would have when reading the results (i.e., what happens in corner cases when the tasks are very related or completely different), which I appreciate. - Experiments in Table 1 seem interesting and support their hypothesis.
- The statements are close to standard compression arguments ("if you have an encoder, then the total complexity is the size of the encoder + that of the encoding of the joint tasks") and seem somewhat straightforward given Theorem 1 from Shalev-Shwartz & Ben-David. The authors argue in lines 399-403 that the advantage is that the bound only depends on the encoding on the tasks all together and not the sum of the individual encodings. But in all the examples given (e.g., line 334 or line 350), i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
