Replicating and Extending "Because Their Treebanks Leak": Graph   Isomorphism, Covariants, and Parser Performance

Mark Anderson; Anders S{\o}gaard; Carlos G\'omez Rodr\'iguez

arXiv:2106.00352·cs.CL·June 3, 2021

Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance

Mark Anderson, Anders S{\o}gaard, Carlos G\'omez Rodr\'iguez

PDF

TL;DR

This paper replicates and extends prior research on how graph isomorphism in treebanks affects parser performance, revealing that the correlation is influenced by covariants and is more evident in controlled experiments.

Contribution

It identifies methodological issues in previous studies and demonstrates the importance of controlled experiments for understanding parser performance factors.

Findings

01

Small subset of sentences show performance variation with isomorphism

02

Correlation between parser performance and isomorphism disappears when controlling covariants

03

Strong correlation observed in controlled experiments with fixed covariants

Abstract

S{\o}gaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.