Deep Regression Representation Learning with Topology

Shihao Zhang; kenji kawaguchi; Angela Yao

arXiv:2404.13904·cs.LG·May 17, 2024

Deep Regression Representation Learning with Topology

Shihao Zhang, kenji kawaguchi, Angela Yao

PDF

Open Access 1 Repo 5 Reviews

TL;DR

This paper explores how the topology of regression representations affects their effectiveness, proposing a new regularizer, PH-Reg, that aligns feature space topology with the target to improve regression performance.

Contribution

It introduces PH-Reg, a novel regularizer that matches the intrinsic dimension and topology of feature space with the target, enhancing regression representation learning.

Findings

01

PH-Reg improves regression performance on synthetic and real-world tasks.

02

Lower intrinsic dimension of features correlates with reduced complexity and better generalization.

03

Topologically similar feature and target spaces enhance representation effectiveness.

Abstract

Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and, therefore, the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representations. We thus wonder how the effectiveness of a regression representation is influenced by its topology, with evaluation based on the Information Bottleneck (IB) principle. The IB principle is an important framework that provides principles for learning effective representations. We establish two connections between it and the topology of regression representations. The first connection reveals that a lower intrinsic dimension of the feature space implies a reduced complexity of the…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

The paper tries to relate quantitative characteristics of data representations from information theory with topological data characteristics, following several recent approaches.

Weaknesses

1) The proposed "intrinsic dimension lowering" loss term $\mathcal{L}_d$ actually disturbs the intrinsic dimension in an unclear way, it can both increase and decrease it, since, for a given data representation $Z$, the term $\mathcal{L}_d$ involves the ratio of logarithms $\log E(Z_n)/ \log E(Y_n)$ where $Y$ is another data representation. There is no $\log E_n(Y)$, $Y-$dependent part, in the formula for intrinsic dimension of $Z$, see eg arXiv:2306.04723 or J. M. Steele, Growth rates of eu

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

1. The IB principle for regression problem is less considered. The motivation of this work is promising. 2. The new generalization error bound in Eq.~(2) is interesting, although it still has some issues (see below).

Weaknesses

Overall, I have some concerns regarding Theorems 1, 2 and 3, especially Theorem 2. 1. I have several concerns regarding the new generalization error bound in Eq.~(2). 1.1 how this bound is connected to the new bound in [1], which emphasized the role of I(X;Z|Y) in classification tasks. [1] Kawaguchi, Kenji, et al. "How Does Information Bottleneck Help Deep Learning?." arXiv preprint arXiv:2305.18887 (2023). 1.2 Is the bound tighter or not? or does H(Z|Y) a good indicator on the generalizati

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

1. Only a few papers study application of topology to machine learning and only a few papers study differentiable topological losses. 2. The paper is mostly well written and clear. 3. New theoretical results are presented. 4. Experiments show improvements in quality measures for super-resolution, depth estimation and age prediction problems. Ablation studies are provided.

Weaknesses

1. I have concerns about the IB principle. Neural networks are deterministic functions, and, thus H(Y|Z) = 0 always. H(Z|Y) > 0 when different Z map to the same Y. This can happen when different X map the the same Y, because for real large networks different input objects X have different embeddings Z. So, some H(Z|Y)>0 depends only on the dataset itself, not the network for realistic scenarios. 2. $min_Z \left( I(Z,X) - \beta I(Z,Y) \right) $ and $min_Z \frac{I(Z,X)}{ \beta I(Z,Y) }$ are diffe

Reviewer 04Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- **Advancing the science of deep learning for regression:** The fact that so much research into the science of deep learning has focused exclusively on classification at the expense of other tasks (such as regression), is a weakness of the field. This work which investigates IB, intrinsic dimension, and the topology of latent space specifically for models trained to perform regression thus represents a welcome research direction. - **Utilizing interesting mathematics:** The paper brings togeth

Weaknesses

**Writing correctness and clarity:** There were a number of issues with the writing that made it more challenging to read the work than it should have been. For instance, typos such as: 1. In the introduction, “The homeomorphic between two…” $\mapsto$ “The homeomorphism between two…”. 2. In the introduction, “…and in the topology view,…” $\mapsto$ “…and from the topological viewpoint,…” 3. In Section 5.1, “In contrast, naively lowering the intrinsic dimension ($+L’_{d}$ ) performs poorly and eve

Reviewer 05Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1) The paper draws interesting and novel connections between the Information Bottleneck principle and the topology and intrinsic dimensions of the representations with respect to the targets 2) The experimental section covers a significant number of diverse datasets, integrating quantitative results with qualitative visualizations 3) The authors include an estimation of the additional computational and memory cost of Ph-Reg demonstrating that the overhead of the proposed method is small comp

Weaknesses

## Main Concerns 1) **Clarity** 1) The underlying assumptions used to prove the statements in Section 3 are not fully clarified. Theorem 1 implicitly restricts the considered representations to deterministic functions of $\bf x$. Previous literature [1] has shown the benefit of using stochastic encoders, for which the result from Theorem 1 is not applicable. 2) Theorem 3 assumes a uniform (conditional) distribution on the manifold $\mathcal{M}_i$ but the paper does not elaborate on unde

Code & Models

Repositories

needylove/ph-reg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis · Face and Expression Recognition

MethodsALIGN · Focus