Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc

TL;DR
This study introduces a new, scalable method for analyzing syntactic differences between speech and writing across languages using dependency treebanks, revealing modality-specific syntactic preferences and limited overlap in structures.
Contribution
It presents a fully inductive, treebank-driven approach to compare syntactic structures in speech and writing across languages, highlighting modality-specific syntactic patterns.
Findings
Spoken corpora have fewer and less diverse syntactic structures than written ones.
Limited overlap exists between spoken and written syntactic inventories.
Speech-specific structures are linked to interactivity and economy of expression.
Abstract
This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
