TL;DR
This paper explores methods to rewrite AI-generated text to appear more human-like, comparing encoder-decoder and decoder-only transformer models, and introduces a new stylistic evaluation perspective.
Contribution
It constructs a large parallel corpus, identifies key stylistic markers, and evaluates multiple models, highlighting the effectiveness of BART-large and issues with current evaluation metrics.
Findings
BART-large achieves high reference similarity scores.
Mistral-7B shows overshoot in stylistic marker shift.
Shift accuracy is a critical blind spot in style transfer evaluation.
Abstract
AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity -- BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 -- with 17x fewer parameters than Mistral-7B. We show that Mistral-7B's higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
