# The HeCz corpus: A large, richly annotated reading corpus of newspaper headlines in Czech

**Authors:** Jan Chromý, Markéta Ceháková, James Brand

PMC · DOI: 10.3758/s13428-025-02863-4 · Behavior Research Methods · 2025-11-14

## TL;DR

The HeCz corpus is a large Czech reading dataset with detailed annotations and comprehension data from over 1,800 participants.

## Contribution

The HeCz corpus introduces a large-scale, richly annotated dataset of Czech newspaper headlines with reading times and comprehension data.

## Key findings

- The corpus includes reading times and comprehension accuracy for 23,634 words across 1919 headlines.
- Exploratory analyses show linguistic, demographic, and methodological variables significantly influence reading times.
- The dataset includes re-testing data from 1162 participants, enabling longitudinal analysis of reading behavior.

## Abstract

Large behavioral datasets that provide detailed data on reading processes are valuable resources for a range of researchers working in linguistics, psychology and cognitive science. This paper presents the HeCz corpus, which comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging. To better understand how these variables might impact processing, we present exploratory analyses where we predicted the reading times for words, with the results indicating important roles for linguistic, demographic, and methodological variables. Given the range of multidisciplinary applications of the HeCz corpus, we hope that it will provide a valuable and unprecedented resource for a range of research applications related to reading processes.

## Full-text entities

- **Diseases:** depression (MESH:D003866), COVID-19 (MESH:D000086382), MECO (MESH:D015835), dyslexia (MESH:D004410), Mood (MESH:D019964), confusion (MESH:D003221), fatigue (MESH:D005221), tension (MESH:D018781)
- **Chemicals:** alcohol (MESH:D000438), Staropramenu (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12618343/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12618343/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC12618343/full.md

---
Source: https://tomesphere.com/paper/PMC12618343