PLLuM: A Family of Polish Large Language Models
Jan Koco\'n, Maciej Piasecki, Arkadiusz Janz, Teddy Ferdinan, {\L}ukasz Radli\'nski, Bart{\l}omiej Koptyra, Marcin Oleksy, Stanis{\l}aw Wo\'zniak, Pawe{\l} Walkowiak, Konrad Wojtasik, Julia Moska, Tomasz Naskr\k{e}t, Bartosz Walkowiak, Mateusz Gniewkowski, Kamil Szyc

TL;DR
PLLuM introduces the largest open-source Polish language models, developed with a new extensive corpus and safety frameworks, to enhance AI support for Polish and promote open research.
Contribution
It presents a new family of Polish language models with a large corpus, safety measures, and alignment techniques, filling a gap in non-English AI resources.
Findings
Models perform well on downstream tasks in public administration.
Introduction of a comprehensive safety and governance framework.
Open release to support research and sovereignty in Poland.
Abstract
Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗CYFRAGOVPL/Llama-PLLuM-70B-basemodel· 3 dl3 dl
- 🤗CYFRAGOVPL/Llama-PLLuM-70B-instructmodel· 11 dl· ♡ 711 dl♡ 7
- 🤗CYFRAGOVPL/Llama-PLLuM-70B-chatmodel· 206 dl· ♡ 2206 dl♡ 2
- 🤗CYFRAGOVPL/PLLuM-8x7B-nc-basemodel· 3 dl· ♡ 13 dl♡ 1
- 🤗CYFRAGOVPL/PLLuM-8x7B-nc-instructmodel· 4 dl· ♡ 44 dl♡ 4
- 🤗CYFRAGOVPL/PLLuM-8x7B-nc-chatmodel· 15 dl· ♡ 415 dl♡ 4
- 🤗CYFRAGOVPL/PLLuM-8x7B-basemodel· 11 dl11 dl
- 🤗CYFRAGOVPL/PLLuM-8x7B-instructmodel· 28 dl· ♡ 228 dl♡ 2
- 🤗CYFRAGOVPL/PLLuM-8x7B-chatmodel· 39 dl· ♡ 1539 dl♡ 15
- 🤗CYFRAGOVPL/PLLuM-12B-nc-basemodel· 16 dl· ♡ 116 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
