SWEb: A Large Web Dataset for the Scandinavian Languages
Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos,, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

TL;DR
This paper introduces SWEb, the largest Scandinavian language web dataset with over one trillion tokens, along with a new benchmark and a novel text extraction method, enabling improved language model training and evaluation.
Contribution
It provides the first large-scale Scandinavian web dataset, a novel model-based text extractor, and a new benchmark for Swedish language model evaluation.
Findings
Models trained on SWEb outperform those trained on FineWeb in Swedish tasks.
The model-based text extractor reduces processing complexity.
Open sharing of data, models, and code facilitates further research.
Abstract
This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.
Peer Reviews
Decision·ICLR 2025 Poster
Originality The SWEb dataset is original in its approach to handling Scandinavian languages. The authors create a model-based extraction process that moves away from rule-heavy, manual extraction methods, simplifying the pipeline. They also introduce HP-MEK, a benchmark specific to Swedish, which adds value by providing a relevant evaluation tool for Scandinavian models. Quality The quality of the work is evident in the detailed steps of the SWEb pipeline. The authors carefully build a proces
1. Limited Applicability Beyond Scandinavian Languages Weakness: The SWEb pipeline is tailored specifically for Scandinavian languages, potentially limiting its scalability or adaptability to non-Scandinavian or low-resource languages. This narrow focus may reduce the general utility of SWEb’s approach in multilingual or global settings where language resources are scarcer. Recommendation: It would be beneficial to discuss adapting the pipeline to other language families or the performance chal
The data set will be a valuable tool for researching language models and understanding their properties for Scandinavian languages, which are all under-resourced and under-investigated.
Some important points about the evaluation are not fully clear. "Benchmark HP-MEK" is unclear: what was the motivation to created this test set and what was exactly evaluated exactly on it? The new text extraction method or language models trained on the extracted texts? For language models (section 4.2) it is said that there is 90/10 training/test splitting. Therefore it is not clear what was involved in evaluations (text extractor or language models or both) and how (on which test set/s).
- Usefulness and relevance: Improving the quality and assessing that this quality has indeed improved for LLMs targeting languages other than English is a clearly relevant topic that will have uses even outside of academia. - Open-source: Authors promise the release of the dataset, benchmark, and utils used to generate/evaluate them. - Technical details and examples: The paper features many detials about the implementation. Even if it was not open-source, I feel fairly confident that this work
While it's a sound contribution, I'm not sure ICLR is the right venue for this work. It lacks algorithmic or theoretical novelty, and it's rather a (very good) application of well-known NLP principles to process a new dataset for specific languages. For example, the heuristics mentioned in the paper are very similar to "old" related work, e.g. [1] https://arxiv.org/abs/1912.07076
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
