# TCBLex - A lexical database of Finnish literary texts for children

**Authors:** Tapio Nojonen, Kiia Korsu, Filip Ginter, Veronika Laippala, Jenna Kanerva

PMC · DOI: 10.3758/s13428-025-02832-x · Behavior Research Methods · 2025-10-15

## TL;DR

TCBLex is a Finnish lexical database of children's literary texts, offering annotated linguistic data and statistics for research and educational purposes.

## Contribution

TCBLex introduces a freely available, annotated Finnish lexical database with psycholinguistic statistics and age-of-first-encounter data for children's literature.

## Key findings

- TCBLex contains over 11 million tokens with parts-of-speech tags and lemmatization.
- The database includes 14 sub-lexicons based on age groups, reading levels, and genres.
- It provides novel age-of-first-encounter data for Finnish words and lemmas in children's literature.

## Abstract

This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at 10.5281/zenodo.15655580.

## Full-text entities

- **Genes:** PPIG (peptidylprolyl isomerase G) [NCBI Gene 9360] {aka CARS-Cyp, CYP, SCAF10, SRCyp}

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12528317/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12528317/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12528317/full.md

---
Source: https://tomesphere.com/paper/PMC12528317