The PLLuM Instruction Corpus

Piotr P\k{e}zik; Filip \.Zarnecki; Konrad Kaczy\'nski; Anna Cichosz; Zuzanna Deckert; Monika Garnys; Izabela Grabarczyk; Wojciech Janowski; Sylwia Karasi\'nska; Aleksandra Kujawiak; Piotr Misztela; Maria Szyma\'nska; Karolina Walkusz; Igor Siek; Maciej Chrab\k{a}szcz; Anna Ko{\l}os; Agnieszka Karli\'nska; Karolina Seweryn; Aleksandra Krasnod\k{e}bska; Paula Betscher; Zofia Cie\'sli\'nska; Katarzyna Kowol; Artur Wilczek; Maciej Trzci\'nski; Katarzyna Dziewulska; Roman Roszko; Tomasz Berna\'s; Jurgita Vai\v{c}enonien\.e; Danuta Roszko; Pawe{\l} Levchuk; Pawe{\l} Kowalski; Irena Prawdzic-Jankowska; Marek Koz{\l}owski; S{\l}awomir Dadas; Rafa{\l} Po\'swiata; Alina Wr\'oblewska; Katarzyna Krasnowska-Kiera\'s; Maciej Ogrodniczuk; Micha{\l} Rudolf; Piotr Rybak; Karolina Saputa; Joanna Wo{\l}oszyn; Marcin Oleksy; Bart{\l}omiej Koptyra; Teddy Ferdinan; Stanis{\l}aw Wo\'zniak; Maciej Piasecki; Pawe{\l} Walkowiak; Konrad Wojtasik; Arkadiusz Janz; Przemys{\l}aw Kazienko; Julia Moska; Jan Koco\'n

arXiv:2511.17161·cs.CL·November 24, 2025

The PLLuM Instruction Corpus

Piotr P\k{e}zik, Filip \.Zarnecki, Konrad Kaczy\'nski, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasi\'nska, Aleksandra Kujawiak, Piotr Misztela, Maria Szyma\'nska, Karolina Walkusz, Igor Siek, Maciej Chrab\k{a}szcz

PDF

Open Access 2 Datasets

TL;DR

This paper introduces the PLLuM instruction dataset for fine-tuning Polish language models, discusses instruction types, and compares human-authored versus synthetic data, providing a resource for future LLM development.

Contribution

It presents a typology of instructions used in PLLuM, analyzes dataset implications, and releases the first subset of the instruction corpus for Polish LLMs.

Findings

01

Insights into instruction dataset composition and implications for LLM adaptation

02

Introduction of the PLLuMIC subset as a resource for model training

03

Observations on the differences between human and synthetic instructions

Abstract

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling