Audio-visual child-adult speaker classification in dyadic interactions

Anfeng Xu; Kevin Huang; Tiantian Feng; Helen Tager-Flusberg; Shrikanth Narayanan

arXiv:2310.01867·eess.AS·June 13, 2025

Audio-visual child-adult speaker classification in dyadic interactions

Anfeng Xu, Kevin Huang, Tiantian Feng, Helen Tager-Flusberg, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper enhances child-adult speaker classification in dyadic interactions by integrating visual cues with audio, resulting in improved accuracy and robustness in identifying speakers.

Contribution

It introduces a multimodal framework combining audio and visual signals for speaker classification, surpassing traditional audio-only methods.

Findings

01

Visual cues improve classification accuracy.

02

Two faces visibility yields higher F1 scores.

03

The approach is robust across diverse conditions.

Abstract

Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Conventional child-adult speaker classification typically relies on audio modeling approaches, overlooking visual signals that convey speech articulation information, such as lip motion. Building on the foundation of an audio-only child-adult speaker classification pipeline, we propose incorporating visual cues through active speaker detection and visual processing models. Our framework involves video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing