ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier Garc\'ia Gilabert, Carlos Escolano, Marta R. Costa-Juss\`a

TL;DR
ReSeTOX is a method that dynamically adjusts attention weights during inference in neural machine translation to significantly reduce toxic language generation without retraining the model.
Contribution
It introduces a novel inference-time technique to mitigate toxicity in NMT by re-learning attention weights, avoiding the need for re-training.
Findings
57% reduction in added toxicity
Maintains 99.5% translation quality
Effective across 164 languages
Abstract
Our proposed method, ReSeTOX (REdo SEarch if TOXic), addresses the issue of Neural Machine Translation (NMT) generating translation outputs that contain toxic words not present in the input. The objective is to mitigate the introduction of toxic language without the need for re-training. In the case of identified added toxicity during the inference process, ReSeTOX dynamically adjusts the key-value self-attention weights and re-evaluates the beam search hypotheses. Experimental results demonstrate that ReSeTOX achieves a remarkable 57% reduction in added toxicity while maintaining an average translation quality of 99.5% across 164 languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Adversarial Robustness in Machine Learning
