From Low Intrinsic Dimensionality to Non-Vacuous Generalization Bounds in Deep Multi-Task Learning

Hossein Zakerinia; Dorsa Ghobadi; Christoph H. Lampert

arXiv:2501.19067·cs.LG·May 22, 2025

From Low Intrinsic Dimensionality to Non-Vacuous Generalization Bounds in Deep Multi-Task Learning

Hossein Zakerinia, Dorsa Ghobadi, Christoph H. Lampert

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that deep multi-task learning models operate in a low-dimensional space, enabling the derivation of non-vacuous generalization bounds and showing high accuracy with fewer parameters than single-task models.

Contribution

It introduces a low-dimensional parametrization method for multi-task networks and derives the first non-vacuous generalization bounds for such models.

Findings

01

Multi-task models have smaller intrinsic dimensionality than single-task models.

02

High-accuracy multi-task solutions can be achieved with fewer parameters.

03

The approach yields the first non-vacuous generalization bounds for deep multi-task networks.

Abstract

Deep learning methods are known to generalize well from training to future data, even in an overparametrized regime, where they could easily overfit. One explanation for this phenomenon is that even when their *ambient dimensionality*, (i.e. the number of parameters) is large, the models' *intrinsic dimensionality* is small; specifically, their learning takes place in a small subspace of all possible weight configurations. In this work, we confirm this phenomenon in the setting of *deep multi-task learning*. We introduce a method to parametrize multi-task network directly in the low-dimensional space, facilitated by the use of *random expansions* techniques. We then show that high-accuracy multi-task solutions can be found with much smaller intrinsic dimensionality (fewer free parameters) than what single-task learning requires. Subsequently, we show that the low-dimensional…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

* The paper is clearly written and well organized. * The proposed approach is conceptually sound and well motivated. * The computed PAC-Bayesian bounds are non-vacuous, which is a meaningful result for multi-task learning.

Weaknesses

* The technical novelty appears limited. * The work largely builds upon the compression-based PAC-Bayesian framework of Lotfi et al. (2022) and extends it in a relatively straightforward manner. * The benefit of the proposed hierarchical parameter-sharing approach over existing methods is not clearly demonstrated. * It remains unclear how the method compares to simpler parameter-sharing schemes, such as decomposing the parameter (w) into shared and task-specific components as in Li et al. (2018)

Reviewer 02Rating 4Confidence 3

Strengths

- The authors clearly demonstrate that the multi-task setting results in higher compressibility and better bounds "in their particular setting". The rigorous formulation makes the multi-task efficiency measurable and comparable to the single task setting. - The amortized intrinsic dimensionality provides a simple defitition of 'how much sharing happens' and can be evaluated directly from experiments. - There is a clear scaling behavior as the number of tasks grows, where the amortized dimension

Weaknesses

- The paper’s contribution is mostly conceptual and empirical as it does not show theoretically that if the tasks are related, e.g., measured in some form of distributional distance, the bounds will be tighter compared to the single task setting. It would be great to provide a sufficiency theoretical guarantee where if the task are related given a certain metric, the bounds will be theoretically improved. - There is no analysis to determine a priori whether the tasks are related or to identify

Reviewer 03Rating 4Confidence 4

Strengths

- The discussion seems to address natural questions that one would have when reading the results (i.e., what happens in corner cases when the tasks are very related or completely different), which I appreciate. - Experiments in Table 1 seem interesting and support their hypothesis.

Weaknesses

- The statements are close to standard compression arguments ("if you have an encoder, then the total complexity is the size of the encoder + that of the encoding of the joint tasks") and seem somewhat straightforward given Theorem 1 from Shalev-Shwartz & Ben-David. The authors argue in lines 399-403 that the advantage is that the bound only depends on the encoding on the tasks all together and not the sum of the individual encodings. But in all the examples given (e.g., line 334 or line 350), i

Code & Models

Repositories

hzakerinia/mtl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification