Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
Tong Wang, Yiqing Xu, Leo Yang Yang

TL;DR
This paper introduces a new method called LLM-assisted Feature Discovery (LFD) that creates interpretable, conceptually clear, and label-disentangled text representations, validated through human and LLM audits across multiple datasets.
Contribution
The paper proposes an operational criterion for interpretability in text representations and develops LFD, an iterative method that produces clearer, less label-entangled features with comparable predictive performance.
Findings
LFD matches strong baseline performance in text classification tasks.
LFD features show higher human and LLM agreement than baseline concepts.
LFD features are less label-leaking according to human audits.
Abstract
Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
