# Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection

**Authors:** Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee, Myunggi Yi

PMC · DOI: 10.3390/jimaging11100341 · Journal of Imaging · 2025-10-01

## TL;DR

This paper introduces a new deep-learning method using multi-channel speech analysis to detect Parkinson’s Disease with high accuracy.

## Contribution

A novel multi-channel spectro-temporal fusion approach for PD detection using CNNs and Vision Transformers is proposed and evaluated.

## Key findings

- Fusing three time–frequency representations improves PD detection performance across different architectures.
- EfficientNet-B2 achieved 84.39% accuracy and 84.35% F1-score, outperforming recent methods.
- Emotionally salient and prosodically emphasized speech yields higher AUC, indicating better discriminability.

## Abstract

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application.

## Linked entities

- **Diseases:** Parkinson’s Disease (MONDO:0005180)

## Full-text entities

- **Diseases:** PD (MESH:D010300), speech impairments (MESH:D013064)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12565443/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12565443/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC12565443/full.md

---
Source: https://tomesphere.com/paper/PMC12565443