Evaluating Syntactic Properties of Seq2seq Output with a Broad Coverage HPSG: A Case Study on Machine Translation
Johnny Tian-Zheng Wei, Khiem Pham, Brian Dillon, Brendan O'Connor

TL;DR
This study investigates the syntactic correctness of seq2seq model outputs in machine translation by using a comprehensive linguistic grammar, revealing high parseability but challenges with rare syntactic structures.
Contribution
It introduces a method to evaluate seq2seq translation outputs against a broad linguistic grammar, highlighting strengths and limitations in syntactic learning.
Findings
Over 93% of translations are parseable by the grammar.
The model struggles with rare syntactic rules.
Certain syntactic constructions differentiate model outputs from references.
Abstract
Sequence to sequence (seq2seq) models are often employed in settings where the target output is natural language. However, the syntactic properties of the language generated from these models are not well understood. We explore whether such output belongs to a formal and realistic grammar, by employing the English Resource Grammar (ERG), a broad coverage, linguistically precise HPSG-based grammar of English. From a French to English parallel corpus, we analyze the parseability and grammatical constructions occurring in output from a seq2seq translation model. Over 93\% of the model translations are parseable, suggesting that it learns to generate conforming to a grammar. The model has trouble learning the distribution of rarer syntactic rules, and we pinpoint several constructions that differentiate translations between the references and our model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Genomics and Phylogenetic Studies · Topic Modeling
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
