Comparing the writing style of real and artificial papers
Diego R. Amancio

TL;DR
This paper presents a network-based methodology to distinguish real scientific papers from artificially generated ones with high accuracy, addressing the issue of scientific fraud and paper authenticity.
Contribution
It introduces a novel approach using complex network features to effectively identify fake papers generated by software like SCIGen.
Findings
Achieved at least 89% accuracy in classification
Identified key network features like accessibility and betweenness
Demonstrated the potential of combining network analysis with traditional methods
Abstract
Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89\% of accuracy. A systematic analysis of features relevance revealed that the accessibility and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Biomedical Text Mining and Ontologies
