PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

Madhurananda Pahar; Caitlin H. Illingworth; Bahman Mirheidari; Hend Elghazaly; Fritz Peters; Sophie Young; Wing-Zin Leung; Labhpreet Kaur; Daniel Blackburn; Heidi Christensen

arXiv:2605.14888·cs.SD·May 15, 2026

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

Madhurananda Pahar, Caitlin H. Illingworth, Bahman Mirheidari, Hend Elghazaly, Fritz Peters, Sophie Young, Wing-Zin Leung, Labhpreet Kaur, Daniel Blackburn, Heidi Christensen

PDF

1 Datasets

TL;DR

PROCESS-2 is a large, validated speech dataset designed to advance automatic detection of cognitive impairment through speech analysis, supporting scalable and non-invasive clinical research.

Contribution

The paper introduces PROCESS-2, a comprehensive, clinically validated speech corpus with standardized tasks and metadata, enabling reproducible research in cognitive impairment detection.

Findings

01

Demonstrated clinically meaningful group separation

02

Achieved stable baseline modelling performance

03

Validated dataset quality and demographic balance

Abstract

Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CognoSpeak/PROCESS-2
dataset· 201 dl
201 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.