# Speaker-turn aware diarization for speech-based cognitive assessments

**Authors:** Sean Shensheng Xu, Xiaoquan Ke, Man-Wai Mak, Ka Ho Wong, Helen Meng, Timothy C. Y. Kwok, Jason Gu, Jian Zhang, Wei Tao, Chunqi Chang

PMC · DOI: 10.3389/fnins.2023.1351848 · Frontiers in Neuroscience · 2024-01-16

## TL;DR

This paper improves speaker diarization for cognitive assessments by enhancing accuracy in identifying speakers during MoCA evaluations.

## Contribution

The paper introduces three novel enhancements to speaker diarization methods for MoCA assessments.

## Key findings

- The proposed enhancements outperform conventional x-vector/PLDA systems in mismatched scenarios.
- The system can hypothesize speaker-turn timestamps even without timestamp information.

## Abstract

Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).

This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.

Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.

The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

## Full-text entities

- **Diseases:** Osteoporosis (MESH:D010024), MCI (MESH:D060825), AD (MESH:D000544), MoCA (MESH:D003072), NCDs (MESH:D019965), Dementia (MESH:D003704), bipolar disorder (MESH:D001714)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** 213 — Homo sapiens (Human), Finite cell line (CVCL_V755)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC10824834/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC10824834/full.md

## References

53 references — full list in the complete paper: https://tomesphere.com/paper/PMC10824834/full.md

---
Source: https://tomesphere.com/paper/PMC10824834