Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance
Mark Anderson, Anders S{\o}gaard, Carlos G\'omez Rodr\'iguez

TL;DR
This paper replicates and extends prior research on how graph isomorphism in treebanks affects parser performance, revealing that the correlation is influenced by covariants and is more evident in controlled experiments.
Contribution
It identifies methodological issues in previous studies and demonstrates the importance of controlled experiments for understanding parser performance factors.
Findings
Small subset of sentences show performance variation with isomorphism
Correlation between parser performance and isomorphism disappears when controlling covariants
Strong correlation observed in controlled experiments with fixed covariants
Abstract
S{\o}gaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
