Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion
Rafael Rivera Soto, Barry Chen, Nicholas Andrews

TL;DR
This paper introduces a novel paraphrase inversion method to recover original texts from paraphrased versions, significantly improving machine-text detector robustness against paraphrasing attacks across multiple domains.
Contribution
The paper proposes a translation-based approach for paraphrase inversion, demonstrating its effectiveness and generalization to unseen paraphrasing models, enhancing detector performance.
Findings
Inversion models improve detector AUROC by +22% on average.
Models generalize well to unseen paraphrasing techniques.
Effective defense against paraphrasing attacks across domains.
Abstract
High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacksparaphrases applied to machine-generated textsare known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
