Online Detection of LLM-Generated Texts via Sequential Hypothesis Testing by Betting
Can Chen, Jun-Kun Wang

TL;DR
This paper introduces an online detection algorithm for identifying LLM-generated texts using sequential hypothesis testing, enabling rapid and statistically guaranteed decisions in streaming content scenarios.
Contribution
The paper proposes a novel online detection method based on sequential hypothesis testing by betting, with statistical guarantees and applicability to streaming data.
Findings
Effective in real-time detection of LLM texts
Controlled false positive rate achieved
Fast identification of LLM sources
Abstract
Developing algorithms to differentiate between machine-generated texts and human-written texts has garnered substantial attention in recent years. Existing methods in this direction typically concern an offline setting where a dataset containing a mix of real and machine-generated texts is given upfront, and the task is to determine whether each sample in the dataset is from a large language model (LLM) or a human. However, in many practical scenarios, sources such as news websites, social media accounts, and online forums publish content in a streaming fashion. Therefore, in this online scenario, how to quickly and accurately determine whether the source is an LLM with strong statistical guarantees is crucial for these media or platforms to function effectively and prevent the spread of misinformation and other potential misuse of LLMs. To tackle the problem of online detection, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Handwritten Text Recognition Techniques
MethodsSoftmax · Attention Is All You Need
