Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach
Gideon Yoffe, Nachum Dershowitz, Ariel Vishne, Barak Sober

TL;DR
This paper presents a data-centric hypothesis-testing framework to quantify how sequentially correlated literary properties influence textual classification, revealing that many models are confounded by thematic continuity rather than stylistic features.
Contribution
The paper introduces a novel statistical method to disentangle sequential correlations from non-sequential features in text classification, improving interpretability and reliability.
Findings
Supervised and neural models are more prone to false positives due to thematic confounding.
Unsupervised traditional features often yield high true positive rates with fewer false positives.
Controlling for sequential correlation enhances classification reliability in authorship and forensic analysis.
Abstract
We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Sentiment Analysis and Opinion Mining · Authorship Attribution and Profiling
