Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection
Chenwang Wu, Yiu-ming Cheung, Shuhai Zhang, Bo Han, Defu Lian

TL;DR
This paper introduces a Markov-informed calibration method for machine-generated text detection that improves accuracy by modeling token-level score biases, with minimal computational cost.
Contribution
It proposes a novel calibration strategy using Markov random fields to address token score biases in metric-based detection methods, enhancing detection robustness.
Findings
Significant performance improvements over baseline detectors.
Effective in cross-LLM and paraphrasing attack scenarios.
Minimal additional computational overhead.
Abstract
While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score…
Peer Reviews
Decision·ICLR 2026 Poster
1. This work identified and analyzed randomness issues during token generation which might influence the detection performance of LLM-generated text detectors, which makes sense and the perspective appears new. 2. A Markov-informed score strategy was formulated and served as a plug-and-play for existing fake text detectors. 3. Experiments on certain datasets were done to validate some claims proposed in this work.
1. Unclear presentation and lack of clarity. The definition and introduction of neighbor similarity is not clear both in the writing part and the illustration part. Particulary in Fig.2 and Fig.3, much important information is missing. E.g., how to compute the detection scores, what kind of distances being used, why there are two variables in the horizontal axis (e.g. position, and log-rank score, which variable do the numerical values 0.0 to 0.8 refer to?) 2. Though some theorems were provide
The basic idea of the approach is sensible and works. The analysis (Figure 2) of how the differences between the log-likelihood scores of pairs of tokens drawn from the text are correlated with the distance between the tokens is interesting. Table 2 is very compelling in showing the uplift in performance when replacing 'averaging' scores from a detector with applying the authors Markov Random Field approach.
1. I do not rate the discussion of the current state of the art very highly, either in its breadth (there is much state of the art work which is missed, including many detectors which do look at local patterns in things like log-likelihood rather than just raw averages) or its accuracy. For example, there is a claim in the abstract and on page 2 that the weaknesses of token based discriminators between human and machine text stem from the randomness present in sampling algorithms for machine tex
- The paper clearly identifies the root cause of score bias in metric-based detectors, namely, randomness introduced by the LLM generation process. - The proposed method is conceptually simple, computationally efficient, and effective. - The experiments are broad and thorough, including challenging setups such as **DetectRL**, and report diagnostic metrics such as **TPR@FPR-1%** in addition to AUROC.
- Although the proposed method is intuitive and effective, there is concern about the practical significance of its improvement for statistical detection methods. As shown in Table 2, all metric-based detectors still perform poorly on the challenging DetectRL benchmark, far below the level required for real-world applications. - The evaluated detectors are somewhat outdated. Incorporating more advanced methods such as Binoculars [1] and RepreGuard [2] could better showcase the effectiveness of t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Authorship Attribution and Profiling · Misinformation and Its Impacts
