Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Michal Bravansky, Vaclav Kubon, Suhas Hariharan, Robert Kirk

TL;DR
This paper introduces a domain-agnostic dataset featurization method that uses unsupervised data reconstruction with large language models to extract precise, compact, and human-aligned features, scalable to large datasets.
Contribution
It presents a novel feature extraction approach leveraging LLMs for data reconstruction, offering control over feature granularity and scale, and demonstrating effectiveness in diverse dataset modeling tasks.
Findings
Effective in capturing dataset diversity and effectiveness.
Automates feature discovery aligned with human preferences.
Scales well with larger datasets and more features.
Abstract
Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
The high-level idea of using extracted features to reconstruct the dataset for selecting features is interesting.
1. The writing of this paper is terrible. It is pretty hard for me to get an understanding of how it works and why it works. For example, I have no idea how you compute the perplexity exactly from lines 170-178. Are you following the contents in lines 140-145? If yes, how are these conditional PPLs measured concretely? Similarly, in Section 5, I don't know why we need to consider the class labels provided by the datasets, and how you evaluate your results. More importantly, why is this experime
The authors' perplexity / reconstruction-based optimisation framing seems novel and their results suggest that it's effective. Their case studies show real applications that are highly relevant to LLM safety and alignment. Extracting compact representations of jailbreaks from a large unlabelled set of attacks without supervision seems highly useful for understanding new attack strategies at scale. The case study on preference modeling also emphasises that the lack of supervision needed for this
Overall I think the paper is quite strong, so struggling to find substantive weaknesses. It's worth calling out that the paper overall provides a useful advancement with this reconstruction focus on unsupervised dataset feature labelling, but as the authors themselves acknowledge, the paper is in large part building on other recent work (Zhong et al. 2024, Findeis et al. 2024, Go et al. 2024). In their experiments comparing their method to Zhong et al, the authors convinced me that their method
* Clear, modular pipeline with a principled dataset-level objective and an interpretable feature representation. * Demonstrates strong compression of jailbreak strategies while maintaining attack efficacy across multiple target models. * Shows that automatically discovered features can match/beat expert-designed features in CPM, with reasonable robustness analyses. * Practical engineering details.
* Binary-only features and reliance on positive instances during optimization limit expressivity. * The optimization is global, so it does not guarantee per-instance optimality; some examples may not benefit from the selected global basis. * No dedicated length-sensitivity/long-context study (data filtered to a range; but scalability to very long inputs remains unclear).
- The authors test the method on several dataset-modeling tasks and case studies, validating it from multiple perspectives. - The method is practical and valuable, yielding better performance than other baselines.
- It is unclear which failure modes arise in the generation stage and how they could be mitigated. - Some claims and methodological choices are not clearly explained or insufficiently supported; I outline specific concerns in the questions section.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · ALIGN
