propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl; Benedikt Droste; Bj\"orn Pl\"uster; Jan Philipp Harries

arXiv:2602.12414·cs.CL·February 20, 2026

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste, Bj\"orn Pl\"uster, Jan Philipp Harries

PDF

Open Access 1 Models 1 Datasets

TL;DR

Propella-1 introduces a family of multilingual LLMs that provide multi-property, structured annotations for documents, enabling nuanced data curation and analysis at scale, surpassing single-score methods in interpretability and detail.

Contribution

The paper presents propella-1, a novel set of multilingual LLMs that generate detailed, multi-property document annotations, and releases a large dataset for improved data quality assessment.

Findings

01

Propella-4B outperforms larger models in annotation agreement.

02

Over three billion document annotations released for diverse datasets.

03

Reveals significant differences in dataset quality and content.

Abstract

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ellamind/propella-1-4b
model· 2.1k dl· ♡ 14
2.1k dl♡ 14

Datasets

openeurollm/propella-annotations
dataset· 2.5k dl
2.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems · Biomedical Text Mining and Ontologies · Data Quality and Management