A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

TL;DR
This paper introduces a novel multivariate Bernoulli sampling method for multi-label datasets that accounts for label dependencies, improving the representation of minority categories in samples.
Contribution
It presents a new sampling algorithm based on multivariate Bernoulli distribution that considers label dependencies and adjusts sampling weights accordingly.
Findings
Produced more balanced samples with better minority category representation.
Effectively preserved label frequency order and reduced frequency disparities.
Applied successfully to biomedical research article data.
Abstract
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Reliability and Agreement in Measurement · Meta-analysis and systematic reviews
