LLM-based feature generation from text for interpretable machine learning

Vojt\v{e}ch Balek; Luk\'a\v{s} S\'ykora; Vil\'em Sklen\'ak; Tom\'a\v{s} Kliegr

arXiv:2409.07132·cs.LG·October 2, 2025

LLM-based feature generation from text for interpretable machine learning

Vojt\v{e}ch Balek, Luk\'a\v{s} S\'ykora, Vil\'em Sklen\'ak, Tom\'a\v{s} Kliegr

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that large language models can generate a small set of interpretable features from scientific text, which are effective for research impact prediction and comparable to state-of-the-art embeddings.

Contribution

The study introduces a method for extracting a compact, interpretable feature set from text using LLMs, enabling transparent machine learning models across diverse scientific domains.

Findings

01

LLM-generated features are semantically meaningful and correlate with research impact.

02

Models trained on these features achieve similar accuracy to SciBERT with fewer, interpretable features.

03

The approach generalizes well across different scientific disciplines.

Abstract

Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vojtech-balek/llm-features
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsSparse Evolutionary Training · LLaMA