Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

Abhijit Sinha; Harishankar Kumar; Mohit Joshi; Hemant Kumar Kathania; Shrikanth Narayanan; Sudarsana Reddy Kadiri

arXiv:2508.10332·eess.AS·August 15, 2025

Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri

PDF

TL;DR

This study investigates how self-supervised speech models encode age and gender traits in children's speech, revealing that early layers are more speaker-specific and improving classification accuracy through PCA.

Contribution

It provides a detailed layer-wise analysis of Wav2Vec2 models for children's speech, highlighting how speaker traits are represented across layers and demonstrating improved classification methods.

Findings

01

Early layers (1-7) capture speaker-specific cues more effectively.

02

Applying PCA enhances classification accuracy and reduces redundancy.

03

Wav2Vec2-large-lv60 achieves over 97% accuracy in age and gender classification.

Abstract

Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.