SWEb: A Large Web Dataset for the Scandinavian Languages

Tobias Norlund; Tim Isbister; Amaru Cuba Gyllensten; Paul Dos Santos,; Danila Petrelli; Ariel Ekgren; Magnus Sahlgren

arXiv:2410.04456·cs.CL·October 8, 2024

SWEb: A Large Web Dataset for the Scandinavian Languages

Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos,, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

PDF

Open Access 3 Models 1 Video 3 Reviews

TL;DR

This paper introduces SWEb, the largest Scandinavian language web dataset with over one trillion tokens, along with a new benchmark and a novel text extraction method, enabling improved language model training and evaluation.

Contribution

It provides the first large-scale Scandinavian web dataset, a novel model-based text extractor, and a new benchmark for Swedish language model evaluation.

Findings

01

Models trained on SWEb outperform those trained on FineWeb in Swedish tasks.

02

The model-based text extractor reduces processing complexity.

03

Open sharing of data, models, and code facilitates further research.

Abstract

This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 5

Strengths

Originality The SWEb dataset is original in its approach to handling Scandinavian languages. The authors create a model-based extraction process that moves away from rule-heavy, manual extraction methods, simplifying the pipeline. They also introduce HP-MEK, a benchmark specific to Swedish, which adds value by providing a relevant evaluation tool for Scandinavian models. Quality The quality of the work is evident in the detailed steps of the SWEb pipeline. The authors carefully build a proces

Weaknesses

1. Limited Applicability Beyond Scandinavian Languages Weakness: The SWEb pipeline is tailored specifically for Scandinavian languages, potentially limiting its scalability or adaptability to non-Scandinavian or low-resource languages. This narrow focus may reduce the general utility of SWEb’s approach in multilingual or global settings where language resources are scarcer. Recommendation: It would be beneficial to discuss adapting the pipeline to other language families or the performance chal

Reviewer 02Rating 6Confidence 4

Strengths

The data set will be a valuable tool for researching language models and understanding their properties for Scandinavian languages, which are all under-resourced and under-investigated.

Weaknesses

Some important points about the evaluation are not fully clear. "Benchmark HP-MEK" is unclear: what was the motivation to created this test set and what was exactly evaluated exactly on it? The new text extraction method or language models trained on the extracted texts? For language models (section 4.2) it is said that there is 90/10 training/test splitting. Therefore it is not clear what was involved in evaluations (text extractor or language models or both) and how (on which test set/s).

Reviewer 03Rating 6Confidence 3

Strengths

- Usefulness and relevance: Improving the quality and assessing that this quality has indeed improved for LLMs targeting languages other than English is a clearly relevant topic that will have uses even outside of academia. - Open-source: Authors promise the release of the dataset, benchmark, and utils used to generate/evaluate them. - Technical details and examples: The paper features many detials about the implementation. Even if it was not open-source, I feel fairly confident that this work

Weaknesses

While it's a sound contribution, I'm not sure ICLR is the right venue for this work. It lacks algorithmic or theoretical novelty, and it's rather a (very good) application of well-known NLP principles to process a new dataset for specific languages. For example, the heuristics mentioned in the paper are very similar to "old" related work, e.g. [1] https://arxiv.org/abs/1912.07076

Code & Models

Models

Videos

SWEb: A Large Web Dataset for the Scandinavian Languages· slideslive

Taxonomy

TopicsNatural Language Processing Techniques