Few-Shot Detection of Machine-Generated Text using Style Representations
Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus, Bishop, and Nicholas Andrews

TL;DR
This paper introduces a novel method for detecting machine-generated text by analyzing writing style representations from human-authored texts, enabling effective identification without relying on model-specific training data.
Contribution
It proposes a style-based detection approach that generalizes across models and can identify the specific model used, overcoming limitations of existing supervised detectors.
Findings
Effective at distinguishing human from machine text.
Able to identify the specific language model used.
Robust to data shifts and new model releases.
Abstract
The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach…
Peer Reviews
Decision·ICLR 2024 poster
It presents a valuable contribution to an important topic. This is both with respect to the dataset (which hopefully will be released), as well as to the methodology. The setting of few-shot is also more realistic than previous proposals which assumed access to a very large corpus of machine-generated data.
The exact setting is not clear, and the presentation of the results under a single value (AUC) makes it hard to assess the impact of this work. In particular, it would be good to see some more examples to get a better intuition. It would seem like the evaluation examples are easily detectable because the LLM mimics the provided persona so well as to insist on it in the generated text (`...as a <persona>, I...` . This might be the reason for the comparably high scores (as opposed to previous work
Pros: - The proposed approach leverages representations of writing style estimated from human-authored text, which can effectively distinguish between human and machine-generated text. - The approach does not rely on samples from language models of concern at training time, which makes it more robust to data shifts and more practical to implement. - The experiments conducted by the authors demonstrate that their approach outperforms previous approaches to detecting machine-generated text.
Cons: - The approach assumes that documents generated from amply available and cheap (AAC) models are available at training time, which may not always be the case in practice. - The work has strong limitations in training the style representation learning network. Basically, the writing style can also be changing over time. The existing writing style training dataset may not be powerful enough to cover the whole. The work adopts(and highly depends on) a supervised approach for this, which has
In general, the paper is well-written. The paper tackles a relevant problem: Detecting LLM-written text. Previous work, such as the classifier released by OpenAI, had much worse performance. Also, the approach is new: Instead of supervised training, the method effectively uses pre-trained style representations for the few-shot setting. The results of the experiments look convincing (maybe overly convincing, considering that, for instance, in Table 1, AUC' is almost perfect even for a false posit
I tend to reject the paper. The reasons are the following: 1. In the end, the paper boils down to applying a few methods that learn stylistic and semantic representations to classify a document as human- or LLM-written. There is no new method, and the contribution lies in applying existing methods and creating a training and evaluation corpus for AAC and LWD LLMs (what the authors do not mention as a contribution, and these datasets will not be made available according to the paper). 2. The expe
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Layer Normalization · Softmax · Residual Connection · Linear Layer · Byte Pair Encoding · Dropout
