# Constructing the Corpus of Children’s Video Media (CCVM): A new resource and guidelines for constructing comparable and reusable corpora

**Authors:** Anna Gowenlock, Jennifer Rodd, Beth Malory, Courtenay Norbury

PMC · DOI: 10.3758/s13428-025-02925-7 · Behavior Research Methods · 2026-01-23

## TL;DR

This paper introduces a new corpus of children's video media to study language exposure for 3–5-year-olds in the UK.

## Contribution

The paper presents the Corpus of Children’s Video Media (CCVM), a new resource comparable to child-directed speech corpora.

## Key findings

- The CCVM contains 233,471 tokens from 161 transcripts representing 43.12 hours of children's video content.
- The corpus includes linguistic annotations and metadata, and is available on the Open Science Framework.

## Abstract

A growing number of psycholinguistic studies use methods from corpus linguistics to examine the language that children encounter in their environment, to understand how they might acquire different aspects of linguistic knowledge. Many of these studies focus on child-directed speech or children’s literature, while there is a paucity of work focusing on children’s television and video media. We describe the creation and contents of the Corpus of Children’s Video Media (CCVM), a specialised corpus designed to represent the spoken language in television and online videos popular among 3–5-year-old children in the UK (available as a scrambled database of tokens). The CCVM was designed to be comparable to an existing corpus of child-directed speech (CDS). We used a dual sampling approach: inclusion decisions were guided by (a) a survey of parents with children in our target age group, and (b) a survey of programmes available on popular streaming platforms. The corpus consists of 233,471 tokens across 161 transcripts (43.12 h of video) and is available on the Open Science Framework (OSF) as a scrambled database of tokens (including gloss, stem, and lemma forms, and part-of-speech tags), organised within transcripts, together with relevant metadata for each transcript. We discuss the challenges of creating a corpus that is comparable to existing datasets and highlight the importance of transparency in this process. We take an open science approach, sharing a detailed data collection and processing protocol, code, and data so that the corpus can be evaluated, extended, and used appropriately by other research teams.

## Full-text entities

- **Diseases:** CDS (MESH:C562515)
- **Chemicals:** Peppa (-)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Felis catus (cat, species) [taxon 9685], Homo sapiens (human, species) [taxon 9606], Sus scrofa (pig, species) [taxon 9823]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12830416/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12830416/full.md

## References

9 references — full list in the complete paper: https://tomesphere.com/paper/PMC12830416/full.md

---
Source: https://tomesphere.com/paper/PMC12830416