Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement
Vivi Nastase, Chunyang Jiang, Giuseppe Samo, Paola Merlo

TL;DR
This study investigates how well multilingual pretrained language models encode syntactic information, specifically subject-verb agreement, across languages, revealing language-specific differences in their syntactic representations.
Contribution
The paper introduces a new synthetic dataset and task to evaluate cross-linguistic syntactic understanding in multilingual models, highlighting their limitations in shared syntactic representations.
Findings
Multilingual models show language-specific syntactic differences.
Models struggle to generalize syntactic structures across languages.
A two-step architecture effectively detects syntactic patterns.
Abstract
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Natural Language Processing Techniques · linguistics and terminology studies
MethodsFocus
