propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Bj\"orn Pl\"uster, Jan Philipp Harries

TL;DR
Propella-1 introduces a family of multilingual LLMs that provide multi-property, structured annotations for documents, enabling nuanced data curation and analysis at scale, surpassing single-score methods in interpretability and detail.
Contribution
The paper presents propella-1, a novel set of multilingual LLMs that generate detailed, multi-property document annotations, and releases a large dataset for improved data quality assessment.
Findings
Propella-4B outperforms larger models in annotation agreement.
Over three billion document annotations released for diverse datasets.
Reveals significant differences in dataset quality and content.
Abstract
Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems · Biomedical Text Mining and Ontologies · Data Quality and Management
