On the data requirements of probing

Zining Zhu; Jixuan Wang; Bai Li; Frank Rudzicz

arXiv:2202.12801·cs.CL·February 28, 2022

On the data requirements of probing

Zining Zhu, Jixuan Wang, Bai Li, Frank Rudzicz

PDF

Open Access 1 Repo

TL;DR

This paper introduces a quantitative method to estimate the optimal size of probing datasets for neural language models, enhancing reliability while managing data collection costs.

Contribution

It presents a novel statistical framework to determine the necessary data samples for effective probing configuration comparisons in neural NLP models.

Findings

01

The method accurately estimates required dataset sizes across case studies.

02

Proposed approach improves the reliability of probing experiments.

03

Framework aids systematic construction of probing datasets.

Abstract

As large and powerful neural language models are developed, researchers have been increasingly interested in developing diagnostic tools to probe them. There are many papers with conclusions of the form "observation X is found in model Y", using their own datasets with varying sizes. Larger probing datasets bring more reliability, but are also expensive to collect. There is yet to be a quantitative method for estimating reasonable probing dataset sizes. We tackle this omission in the context of comparing two probing configurations: after we have collected a small dataset from a pilot study, how many additional data samples are sufficient to distinguish two different configurations? We present a novel method to estimate the required number of data samples in such experiments and, across several case studies, we verify that our estimations have sufficient statistical power. Our framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spoclab-ca/probing_dataset
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification