Categorical distance correlation under general encodings and its application to high-dimensional feature screening
Qingyang Zhang

TL;DR
This paper extends distance correlation to categorical data with general encodings, improving feature screening in high-dimensional settings by leveraging category spacing information and providing theoretical guarantees.
Contribution
It introduces a novel approach to categorical distance correlation using general encodings, with theoretical properties and practical applications for high-dimensional feature screening.
Findings
Encoding methods significantly affect correlation performance
Proposed estimates have well-defined limiting distributions
Method demonstrates effective high-dimensional screening in simulations
Abstract
In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference
