Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP
Stefan Larson, Navtej Singh, Saarthak Maheshwari, Shanti Stewart, Uma, Krishnaswamy

TL;DR
This paper investigates how well text classifiers trained on Tobacco-3482 and RVL-CDIP datasets can handle out-of-distribution documents, revealing that larger training datasets improve out-of-distribution robustness.
Contribution
The study introduces new out-of-distribution evaluation datasets and compares the generalization performance of models trained on different dataset sizes.
Findings
Models trained on Tobacco-3482 perform poorly out-of-distribution.
Models trained on RVL-CDIP show smaller performance drops.
Larger training datasets enhance out-of-distribution robustness.
Abstract
To be robust enough for widespread adoption, document analysis systems involving machine learning models must be able to respond correctly to inputs that fall outside of the data distribution that was used to generate the data on which the models were trained. This paper explores the ability of text classifiers trained on standard document classification datasets to generalize to out-of-distribution documents at inference time. We take the Tobacco-3482 and RVL-CDIP datasets as a starting point and generate new out-of-distribution evaluation datasets in order to analyze the generalization performance of models trained on these standard datasets. We find that models trained on the smaller Tobacco-3482 dataset perform poorly on our new out-of-distribution data, while text classification models trained on the larger RVL-CDIP exhibit smaller performance drops.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
