GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

Jonathan Drechsel; Steffen Herbold

arXiv:2502.01406·cs.LG·March 10, 2026

GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

Jonathan Drechsel, Steffen Herbold

PDF

Open Access 1 Repo 7 Models 5 Datasets 3 Reviews

TL;DR

This paper presents GRADIEND, a gradient-based feature learning method that identifies and modifies model biases related to social attributes, enabling debiasing while preserving model performance.

Contribution

Introduces a gradient-based encoder-decoder technique to learn and modify societal bias features within neural networks, facilitating debiasing without losing capabilities.

Findings

01

Effectively identifies bias-related weights in models

02

Can rewrite models to reduce biases

03

Maintains model performance after debiasing

Abstract

AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The idea of encoding feature semantics directly from gradients is elegant, bridging interpretability and parameter-level debiasing with weight-modifying. 2. The proposed method is mathematically well-formulated. The vivid illustration in Fig. 1 and explanations in Section 3 are very clear. 3. I appreciate the experiments. The authors evaluated the method on seven major transformer models covering both encoder- and decoder-only LMs. They conducted a systematic analysis across three bias domain

Weaknesses

1. The encoder–decoder setup might be viewed as a re-parameterization of gradient differences, lacking deeper theoretical justification or analysis of why it works indeed. 2. There is no theoretical analysis of how the task performance would be affected. 3. Binary gender assumption and limited race/religion classes make the fairness conclusions narrow. 4. There is no comparison with more recent causal or reinforcement-based debiasing techniques.

Reviewer 02Rating 4Confidence 3

Strengths

The main strength is the new technique. Using gradients to learn a feature for bias is an interesting idea. The study also has a wide range of experiments. The authors evaluated on a vast set of models, which is good. The results show the method can change the models, which is a high impact.

Weaknesses

- The paper's primary weakness is its presentation, which makes the methodology difficult to understand. The authors first explain the methodology with formal definitions and mathematics before providing a high-level overview. In its current format it is really hard for the reader to comprehend what you are actually doing. The paper would be significantly improved by first explaining the method conceptually and providing intuition about each step, and then diving into the formal definitions. An

Reviewer 03Rating 8Confidence 4

Strengths

1. This paper introduces a highly effective methodology for analyzing and potentially mitigating representational biases withi language models. The proposed method leverages the gradients collected during the training process of the target LM. Specifically, it employs a single neuron bottleneck encoder-decoder network to classify updates into bias-related and non-bias-related feature classes. The experimental results demonstrate that this approach is robust and significantly useful in isolating

Weaknesses

Given that fine-tuning remains the dominant paradigm in the modern development and deployment of Language Models (LMs), the current methodology presented in this paper appears to overlook its direct applicability within this context. It would significantly strengthen the paper's relevance and impact to include a dedicated discussion on how the proposed method can be practically leveraged or adapted during the fine-tuning process. This discussion should address potential complexities, necessary

Code & Models

Repositories

aieng-lab/gradiend
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques