VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Yanir Marmor; Arad Zulti; David Krongauz; Adam Gabet; Yoad Snapir; Yair Lifshitz; and Eran Segal

arXiv:2603.01270·eess.AS·March 6, 2026

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir, Yair Lifshitz, and Eran Segal

PDF

Open Access 2 Datasets

TL;DR

VoxKnesset is a comprehensive Hebrew speech dataset spanning 15 years, enabling research on aging effects in speech processing and benchmarking modern models for age prediction and speaker verification.

Contribution

The paper introduces VoxKnesset, a large-scale, longitudinal Hebrew speech dataset with aligned transcripts and metadata, supporting aging-related speech research and model evaluation.

Findings

01

Speaker verification performance degrades over 15 years.

02

Cross-sectional age regressors struggle with within-speaker aging.

03

Longitudinal models can recover meaningful aging signals.

Abstract

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Authorship Attribution and Profiling