Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations
Emil Nuutinen, Iiro Rastas, Filip Ginter

TL;DR
This paper presents a straightforward method using DeepL to translate span-annotated datasets, demonstrated by creating a Finnish SQuAD2.0 dataset and training QA models, showing improved translation quality and potential for other languages and datasets.
Contribution
The paper introduces a simple, effective translation approach for span-annotated datasets, validated on Finnish SQuAD2.0, with open access to code and data.
Findings
The translation method yields higher quality data than alternatives.
QA models trained on translated data perform well.
The approach is adaptable to other datasets and languages.
Abstract
We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
Methodstravel james
