The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English
Tom S Juzek

TL;DR
This paper introduces a large, publicly accessible dataset of 1,000 English sentences annotated for grammaticality and acceptability, facilitating research in syntax and computational linguistics.
Contribution
It provides the first sizable dataset combining formal grammaticality labels with native speaker acceptability judgments, enabling new analyses and model evaluations.
Findings
Grammaticality and acceptability judgments agree in 83% of cases.
Models predict acceptability better than grammaticality.
In-betweenness of judgments is common.
Abstract
We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status ("well-formedness" according to syntactic formalisms) extracted from the literature, as well as its acceptability status ("intuitive goodness" as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
