GRADIEND: Feature Learning within Neural Networks Exemplified through Biases
Jonathan Drechsel, Steffen Herbold

TL;DR
This paper presents GRADIEND, a gradient-based feature learning method that identifies and modifies model biases related to social attributes, enabling debiasing while preserving model performance.
Contribution
Introduces a gradient-based encoder-decoder technique to learn and modify societal bias features within neural networks, facilitating debiasing without losing capabilities.
Findings
Effectively identifies bias-related weights in models
Can rewrite models to reduce biases
Maintains model performance after debiasing
Abstract
AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea of encoding feature semantics directly from gradients is elegant, bridging interpretability and parameter-level debiasing with weight-modifying. 2. The proposed method is mathematically well-formulated. The vivid illustration in Fig. 1 and explanations in Section 3 are very clear. 3. I appreciate the experiments. The authors evaluated the method on seven major transformer models covering both encoder- and decoder-only LMs. They conducted a systematic analysis across three bias domain
1. The encoder–decoder setup might be viewed as a re-parameterization of gradient differences, lacking deeper theoretical justification or analysis of why it works indeed. 2. There is no theoretical analysis of how the task performance would be affected. 3. Binary gender assumption and limited race/religion classes make the fairness conclusions narrow. 4. There is no comparison with more recent causal or reinforcement-based debiasing techniques.
The main strength is the new technique. Using gradients to learn a feature for bias is an interesting idea. The study also has a wide range of experiments. The authors evaluated on a vast set of models, which is good. The results show the method can change the models, which is a high impact.
- The paper's primary weakness is its presentation, which makes the methodology difficult to understand. The authors first explain the methodology with formal definitions and mathematics before providing a high-level overview. In its current format it is really hard for the reader to comprehend what you are actually doing. The paper would be significantly improved by first explaining the method conceptually and providing intuition about each step, and then diving into the formal definitions. An
1. This paper introduces a highly effective methodology for analyzing and potentially mitigating representational biases withi language models. The proposed method leverages the gradients collected during the training process of the target LM. Specifically, it employs a single neuron bottleneck encoder-decoder network to classify updates into bias-related and non-bias-related feature classes. The experimental results demonstrate that this approach is robust and significantly useful in isolating
Given that fine-tuning remains the dominant paradigm in the modern development and deployment of Language Models (LMs), the current methodology presented in this paper appears to overlook its direct applicability within this context. It would significantly strengthen the paper's relevance and impact to include a dedicated discussion on how the proposed method can be practically leveraged or adapted during the fine-tuning process. This discussion should address potential complexities, necessary
Code & Models
- 🤗aieng-lab/bert-base-cased-gradiend-gender-debiasedmodel
- 🤗aieng-lab/bert-large-cased-gradiend-gender-debiasedmodel· 6 dl6 dl
- 🤗aieng-lab/distilbert-base-cased-gradiend-gender-debiasedmodel· 6 dl6 dl
- 🤗aieng-lab/roberta-large-gradiend-gender-debiasedmodel· 4 dl4 dl
- 🤗aieng-lab/gpt2-gradiend-gender-debiasedmodel· 3 dl3 dl
- 🤗aieng-lab/Llama-3.2-3B-gradiend-gender-debiasedmodel· 5 dl5 dl
- 🤗aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiasedmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
