The PLLuM Instruction Corpus
Piotr P\k{e}zik, Filip \.Zarnecki, Konrad Kaczy\'nski, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasi\'nska, Aleksandra Kujawiak, Piotr Misztela, Maria Szyma\'nska, Karolina Walkusz, Igor Siek, Maciej Chrab\k{a}szcz

TL;DR
This paper introduces the PLLuM instruction dataset for fine-tuning Polish language models, discusses instruction types, and compares human-authored versus synthetic data, providing a resource for future LLM development.
Contribution
It presents a typology of instructions used in PLLuM, analyzes dataset implications, and releases the first subset of the instruction corpus for Polish LLMs.
Findings
Insights into instruction dataset composition and implications for LLM adaptation
Introduction of the PLLuMIC subset as a resource for model training
Observations on the differences between human and synthetic instructions
Abstract
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling
