A Differentiable Rank-Based Objective For Better Feature Learning

Krunoslav Lehman Pavasovic; David Lopez-Paz; Giulio Biroli; Levent; Sagun

arXiv:2502.09445·stat.ML·February 14, 2025

A Differentiable Rank-Based Objective For Better Feature Learning

Krunoslav Lehman Pavasovic, David Lopez-Paz, Giulio Biroli, Levent, Sagun

PDF

Open Access 3 Reviews

TL;DR

This paper introduces difFOCI, a differentiable approximation of a non-parametric dependence measure, enabling improved feature selection, neural network regularization, and fairness in classification tasks.

Contribution

We develop difFOCI, a differentiable, parametric version of FOCI, allowing broader application in feature learning, neural network training, and fairness without sensitive data.

Findings

01

difFOCI outperforms FOCI in variable selection tasks

02

It enhances feature learning and reduces spurious correlations

03

It can be integrated into neural networks for improved performance

Abstract

In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in \cite{azadkia2021simple}. While FOCI is based on a non-parametric coefficient of conditional dependence, we introduce its parametric, differentiable approximation. With this approximate coefficient of correlation, we present a new algorithm called difFOCI, which is applicable to a wider range of machine learning problems thanks to its differentiable nature and learnable parameters. We present difFOCI in three contexts: (1) as a variable selection method with baseline comparisons to FOCI, (2) as a trainable model parametrized with a neural network, and (3) as a generic, widely applicable neural network regularizer, one that improves feature…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The motivation is clear to me. FOCI is a tool for selecting important features from data based on their statistical relationships. However, it is not differentiable, which hinders its use in deep neural networks. To address this limitation, this submission proposes a differentiable, parametric approximation. - This submission provides a clear definition of difFOCI. The toy examples offer some intuition into how difFOCI works. - The three applications are well-chosen. They effectively demonstra

Weaknesses

- While some real-world datasets are used in the experiments to demonstrate the effectiveness of difFOCI, it is unclear if it can be extended to large-scale datasets. Specifically, the datasets in Section 5.1 are small-scale, and the neural networks or learning algorithms used are relatively simple. The Waterbird task, for example, is simpler compared to multi-class tasks. Please discuss the scalability and generalization potential of difFOCI. - The fairness study is interesting; however, the da

Reviewer 02Rating 6Confidence 3

Strengths

1. The motivation of the paper is well stated. 2. The paper is well-structured and well-written. 3. Providing results on both toy experiments and real world datasets makes the paper more solid.

Weaknesses

1. The real world datasets seem to be out-dated, where the latest one was released in 2019. It would be more convincing to presents results on more recent are more complex datasets, such as those in WILDS benchmark. 2. This paper only considers one model architecture, i.e., ResNet-50. With the increasing usage of Transformer-based models, it is also important to show the effectiveness on more complex models. 3. Simply showing the improved performance on worst group accuracy does not sufficient

Reviewer 03Rating 6Confidence 3

Strengths

* S1. The proposed method is sound and has good potential for feature selection and feature debiasing * S2. The method has mainly good results in synthetic and some real datasets.

Weaknesses

* W1. The experiments on real datasets are not that strong. * W1.1 For the spurious correlation experiments, Waterbirds dataset is a small and simple dataset and a successful method should be tested on other datasets besides it. How many seeds were used? Was the same protocol used to select hyperparameters for the baselines and the proposed method? The benchmark used in [A] can be used to evaluate the proposed method more rigorously. The results of the method will be more reliable if multiple

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Machine Learning and Data Classification