A Two-Sample Test of Text Generation Similarity
Jingbin Xu, Chen Qian, Meimei Liu, Feng Guo

TL;DR
This paper introduces a new statistical test for comparing the similarity of two groups of documents by analyzing their entropy using neural network language models, with proven asymptotic properties and improved power.
Contribution
It presents a novel two-sample test for text similarity based on entropy estimation, combining neural network models with a data-splitting inference framework.
Findings
Maintains nominal Type I error rate.
Offers greater power than existing methods.
Validated through simulations and real data.
Abstract
The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the probabilistic mapping generating the textual data is identical across two groups of documents. The proposed test aims to assess text similarity by comparing the entropy of the documents. Entropy is estimated using neural network-based language models. The test statistic is derived from an estimation-and-inference framework, where the entropy is first approximated using an estimation set, followed by inference on the remaining data set. We showed theoretically that under mild conditions, the test statistic asymptotically follows a normal distribution. A multiple data-splitting strategy is proposed to enhance test power, which combines p-values into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Handwritten Text Recognition Techniques
