LLM-based feature generation from text for interpretable machine learning
Vojt\v{e}ch Balek, Luk\'a\v{s} S\'ykora, Vil\'em Sklen\'ak, Tom\'a\v{s} Kliegr

TL;DR
This paper demonstrates that large language models can generate a small set of interpretable features from scientific text, which are effective for research impact prediction and comparable to state-of-the-art embeddings.
Contribution
The study introduces a method for extracting a compact, interpretable feature set from text using LLMs, enabling transparent machine learning models across diverse scientific domains.
Findings
LLM-generated features are semantically meaningful and correlate with research impact.
Models trained on these features achieve similar accuracy to SciBERT with fewer, interpretable features.
The approach generalizes well across different scientific disciplines.
Abstract
Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training · LLaMA
