MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening

Yongqi Shao; Binxin Mei; Cong Tan; Hong Huo; Tao Fang

arXiv:2508.20513·cs.SD·August 29, 2025

MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening

Yongqi Shao, Binxin Mei, Cong Tan, Hong Huo, Tao Fang

PDF

Open Access

TL;DR

MoTAS introduces a novel framework combining TTS data augmentation and MoE-based feature selection to improve early Alzheimer's screening from speech, achieving state-of-the-art accuracy in limited data scenarios.

Contribution

The paper presents MoTAS, a new method integrating TTS augmentation and MoE for adaptive feature selection in multimodal speech analysis for Alzheimer's detection.

Findings

01

Achieves 85.71% accuracy on ADReSSo dataset.

02

Outperforms existing baseline methods.

03

Validates effectiveness of TTS and MoE components through ablation studies.

Abstract

Early screening for Alzheimer's Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo…

Tables3

Table 1. Table 1. Dataset Splits Before and After 3 × \times Augmentation

	Train	Train_aug	Test
AD	87	253	35
CN	79	228	36
Total	166	481	71

Table 2. Table 2. Comparison of Our Method With Existing Approaches on the ADReSSo Test Set. Metrics Include Accuracy, Precision, Recall, and F1-Score, Following Definitions From the Baseline Study (Luz et al., 2021 ) . Our Method’s Results Are Averaged Over Five Runs.

Without ASR Transcripts
Method	Accuracy (%)	Precision (%)		Recall (%)		F1 Score (%)
Method	Accuracy (%)	AD	CN	AD	CN	AD	CN
ADReSSo Baseline (eGeMAPS+SVM)(Luz et al., 2021)	64.79	-	-	-	-	-	-
Wav2Vec2+TB (Pan et al., 2021)	74.65	77.42	72.50	68.57	80.56	72.73	76.32
Whisper-TL medium(Li and Zhang, 2024)	77.46	77.14	77.78	77.14	77.78	77.14	77.78
With ASR Transcripts
ADReSSo Baseline (Late Fusion)(Luz et al., 2021)	78.87	77.78	80.00	80.00	77.78	78.87	78.87
WavBERT M_b (W2V2 ASR + BERT)(Zhu et al., 2021)	73.24	75.00	71.79	68.57	77.78	71.64	74.67
C-Attention-Unified(Wang et al., 2021)	78.03	74.15	84.12	87.22	68.57	80.09	75.42
WavBERT M_p (W2V2 ASR + BERT + Pauses)(Zhu et al., 2021)	83.10	87.10	80.00	77.14	88.89	81.82	84.21
TDNN-ASR-M5(Pan et al., 2021)	84.51	81.58	87.88	88.57	80.56	84.93	84.06
Whisper-TL-FTP Medium(Li and Zhang, 2024)	84.51	83.33	85.71	85.71	83.33	84.50	84.50
MoTAS (Ours)	85.71	80.49	93.10	94.29	77.14	86.84	84.38

Table 3. Table 3. Ablation Study on TTS Augmentation and MoE. We Evaluated Dataset Expansion at 1.5 × \times , 2 × \times , 2.5 × \times , and 3 × \times , Followed by MoE Ablation on Both the Original and Best-Performing Augmented Datasets. Results Are Averaged Over Five Runs on the ADReSSo Test Set (Luz et al., 2021 ) .

ID	TTS (times)	MoE	Accuracy (%)	Precision (%)		Recall (%)		F1 Score (%)
ID	TTS (times)	MoE	Accuracy (%)	AD	CN	AD	CN	AD	CN
1	X	X	78.28	81.46	76.06	73.71	82.86	77.20	79.20
2	X	✓	79.71	82.58	77.98	76.00	83.43	78.81	80.40
3	✓(2)	X	81.72	78.30	86.75	88.00	75.43	82.74	80.48
4	✓(2)	✓	85.71	80.49	93.10	94.29	77.14	86.84	84.38
5	✓(1.5)	✓	81.72	80.76	82.99	83.43	80.00	82.02	81.38
6	✓(2.5)	✓	82.86	79.29	87.73	89.14	76.57	83.87	81.68
7	✓(3)	✓	80.29	80.65	82.40	81.72	78.86	80.43	79.75

Equations30

T = f_{ASR} (S), where f_{ASR} = Whisper

T = f_{ASR} (S), where f_{ASR} = Whisper

\overset{s}{^}_{i}^{AD} = f_{TTS} (t_{j}^{AD}, s_{i}^{AD}), \overset{s}{^}_{i}^{CN} = f_{TTS} (t_{j}^{CN}, s_{i}^{CN})

\overset{s}{^}_{i}^{AD} = f_{TTS} (t_{j}^{AD}, s_{i}^{AD}), \overset{s}{^}_{i}^{CN} = f_{TTS} (t_{j}^{CN}, s_{i}^{CN})

S_{aug} = S_{orig} \cup {\overset{s}{^}_{i}^{AD}, \overset{s}{^}_{i}^{CN}}

S_{aug} = S_{orig} \cup {\overset{s}{^}_{i}^{AD}, \overset{s}{^}_{i}^{CN}}

T_{aug} = f_{ASR} (S_{aug}), where f_{ASR} = Whisper

T_{aug} = f_{ASR} (S_{aug}), where f_{ASR} = Whisper

Data_{aug} = {S_{aug}, T_{aug}}

Data_{aug} = {S_{aug}, T_{aug}}

X = {x_{w}, x_{m}, x_{s}, x_{t}} = {f_{W2V2} (s_{i}), f_{MFCC} (s_{i}), f_{Spec} (s_{i}), f_{Text} (t_{i})}

X = {x_{w}, x_{m}, x_{s}, x_{t}} = {f_{W2V2} (s_{i}), f_{MFCC} (s_{i}), f_{Spec} (s_{i}), f_{Text} (t_{i})}

X_{MoE} = {x_{m}, x_{s}, x_{t}} = {f_{MFCC} (s_{i}), f_{Spec} (s_{i}), f_{Text} (t_{i})}

X_{MoE} = {x_{m}, x_{s}, x_{t}} = {f_{MFCC} (s_{i}), f_{Spec} (s_{i}), f_{Text} (t_{i})}

y_{i} = E_{i} (x), i \in {1, 2, \dots, k}

y_{i} = E_{i} (x), i \in {1, 2, \dots, k}

w = G (x) = softmax (W_{g} x + b_{g})

w = G (x) = softmax (W_{g} x + b_{g})

x_{MoE}^{m f cc}

x_{MoE}^{m f cc}

x_{MoE}^{s p ec}

x_{MoE}^{t e x t}

x_{final} = concat (x_{MoE}^{m f cc}, x_{MoE}^{s p ec}, x_{MoE}^{t e x t}, x_{w})

x_{final} = concat (x_{MoE}^{m f cc}, x_{MoE}^{s p ec}, x_{MoE}^{t e x t}, x_{w})

f_{MLP} (x) = FC_{3} (ReLU (Dropout (FC_{2} (ReLU (FC_{1} (x))))))

f_{MLP} (x) = FC_{3} (ReLU (Dropout (FC_{2} (ReLU (FC_{1} (x))))))

y = f_{MLP} (x_{final})

y = f_{MLP} (x_{final})

L = - i = 1 \sum N [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]

L = - i = 1 \sum N [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Topic Modeling

Full text

MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening

Yongqi Shao