SynSciPass: detecting appropriate uses of scientific text generation
Domenic Rosati

TL;DR
This paper introduces SynSciPass, a nuanced dataset and framework for detecting machine-generated scientific text, addressing limitations of binary classification and improving robustness across domains.
Contribution
It develops a dataset with technology-specific labels for machine-generated text and demonstrates improved detection robustness and technology identification over existing models.
Findings
Model trained on SynSciPass is more robust to domain shifts.
Current datasets are insufficient for real-world detection scenarios.
Models can identify the type of text generation technology used.
Abstract
Approaches to machine generated text detection tend to focus on binary classification of human versus machine written text. In the scientific domain where publishers might use these models to examine manuscripts under submission, misclassification has the potential to cause harm to authors. Additionally, authors may appropriately use text generation models such as with the use of assistive technologies like translation tools. In this setting, a binary classification scheme might be used to flag appropriate uses of assistive text generation technology as simply machine generated which is a cause of concern. In our work, we simulate this scenario by presenting a state-of-the-art detector trained on the DAGPap22 with machine translated passages from Scielo and find that the model performs at random. Given this finding, we develop a framework for dataset development that provides a nuanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
