AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice, Dell'Orletta, Andrea Esuli

TL;DR
This study demonstrates that fine-tuned LLMs can generate convincing Italian news articles that are difficult to detect, highlighting the urgent need for more effective detection methods for synthetic news content.
Contribution
It shows that small-scale fine-tuning of Llama on Italian news data produces highly realistic synthetic news, and evaluates the limitations of current detection methods.
Findings
Fine-tuned Llama can produce indistinguishable Italian news articles.
Existing detection methods are impractical or ineffective in real-world scenarios.
Creating a proxy CFM with minimal data is feasible but requires knowledge of the base model.
Abstract
Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques
MethodsBalanced Selection · LLaMA
