Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology
Jonathan Dunn

TL;DR
This paper introduces a comprehensive computational approach to analyze global syntactic variation across seven major languages, leveraging Construction Grammar and social media data to improve regional dialect prediction and understanding of language change.
Contribution
It presents a novel, scalable methodology that models multiple languages simultaneously using computational construction grammar and large-scale web data, overcoming previous limitations in dialectology.
Findings
Models outperform simpler features in predicting regional origin.
Construction Grammar provides more robust and generalizable dialect models.
Global-scale analysis reveals patterns of syntactic variation across languages.
Abstract
The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale. To this end, the paper focuses on removing three constraints that have previously limited work within dialectology/dialectometry. First, rather than assuming a fixed and incomplete set of variants, we use Computational Construction Grammar to provide a replicable and falsifiable set of syntactic features. Second, rather than assuming a specific area of interest, we use global language mapping based on web-crawled and social media datasets to determine the selection of national varieties. Third, rather than looking at a single language in isolation, we model seven major languages together using the same methods: Arabic, English, French, German, Portuguese, Russian, and Spanish. Results show that models for each language are able to robustly predict the region-of-origin of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
