Adversarial Fine-Tuning of Language Models: An Iterative Optimisation   Approach for the Generation and Detection of Problematic Content

Charles O'Neill; Jack Miller; Ioana Ciuca; Yuan-Sen Ting; Thang Bui

arXiv:2308.13768·cs.CL·August 29, 2023·1 cites

Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content

Charles O'Neill, Jack Miller, Ioana Ciuca, Yuan-Sen Ting, Thang Bui

PDF

Open Access

TL;DR

This paper introduces an adversarial fine-tuning method for large language models that iteratively improves the detection of harmful content, outperforming GPT-4 in identifying problematic prompts.

Contribution

The paper presents a novel dual-stage adversarial fine-tuning approach that enhances the ability of language models to generate and detect problematic content.

Findings

01

Significant increase in classification accuracy of the judge model.

02

A rudimentary model outperforms GPT-4 after few fine-tuning rounds.

03

Improved performance in toxic comment detection.

Abstract

In this paper, we tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs) with a novel dual-stage optimisation technique using adversarial fine-tuning. Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. In this adversarial cycle, the two models seek to outperform each other in the prompting phase, generating a dataset of rich examples which are then used for fine-tuning. This iterative application of prompting and fine-tuning allows continuous refinement and improved performance. The performance of our approach is evaluated through classification accuracy on a dataset consisting of problematic prompts not detected by GPT-4, as well as a selection of contentious but unproblematic prompts. We show considerable increase in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer