MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer's Early Screening
Yongqi Shao, Binxin Mei, Cong Tan, Hong Huo, Tao Fang

TL;DR
MoTAS introduces a novel framework combining TTS data augmentation and MoE-based feature selection to improve early Alzheimer's screening from speech, achieving state-of-the-art accuracy in limited data scenarios.
Contribution
The paper presents MoTAS, a new method integrating TTS augmentation and MoE for adaptive feature selection in multimodal speech analysis for Alzheimer's detection.
Findings
Achieves 85.71% accuracy on ADReSSo dataset.
Outperforms existing baseline methods.
Validates effectiveness of TTS and MoE components through ablation studies.
Abstract
Early screening for Alzheimer's Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo…
| Train | Train_aug | Test | |
|---|---|---|---|
| AD | 87 | 253 | 35 |
| CN | 79 | 228 | 36 |
| Total | 166 | 481 | 71 |
| Method | Accuracy (%) | Precision (%) | Recall (%) | F1 Score (%) | |||
|---|---|---|---|---|---|---|---|
| AD | CN | AD | CN | AD | CN | ||
| Without ASR Transcripts | |||||||
| ADReSSo Baseline (eGeMAPS+SVM)(Luz et al., 2021) | 64.79 | - | - | - | - | - | - |
| Wav2Vec2+TB (Pan et al., 2021) | 74.65 | 77.42 | 72.50 | 68.57 | 80.56 | 72.73 | 76.32 |
| Whisper-TL medium(Li and Zhang, 2024) | 77.46 | 77.14 | 77.78 | 77.14 | 77.78 | 77.14 | 77.78 |
| With ASR Transcripts | |||||||
| ADReSSo Baseline (Late Fusion)(Luz et al., 2021) | 78.87 | 77.78 | 80.00 | 80.00 | 77.78 | 78.87 | 78.87 |
| WavBERT Mb (W2V2 ASR + BERT)(Zhu et al., 2021) | 73.24 | 75.00 | 71.79 | 68.57 | 77.78 | 71.64 | 74.67 |
| C-Attention-Unified(Wang et al., 2021) | 78.03 | 74.15 | 84.12 | 87.22 | 68.57 | 80.09 | 75.42 |
| WavBERT Mp (W2V2 ASR + BERT + Pauses)(Zhu et al., 2021) | 83.10 | 87.10 | 80.00 | 77.14 | 88.89 | 81.82 | 84.21 |
| TDNN-ASR-M5(Pan et al., 2021) | 84.51 | 81.58 | 87.88 | 88.57 | 80.56 | 84.93 | 84.06 |
| Whisper-TL-FTP Medium(Li and Zhang, 2024) | 84.51 | 83.33 | 85.71 | 85.71 | 83.33 | 84.50 | 84.50 |
| MoTAS (Ours) | 85.71 | 80.49 | 93.10 | 94.29 | 77.14 | 86.84 | 84.38 |
| ID | TTS (times) | MoE | Accuracy (%) | Precision (%) | Recall (%) | F1 Score (%) | |||
|---|---|---|---|---|---|---|---|---|---|
| AD | CN | AD | CN | AD | CN | ||||
| 1 | X | X | 78.28 | 81.46 | 76.06 | 73.71 | 82.86 | 77.20 | 79.20 |
| 2 | X | ✓ | 79.71 | 82.58 | 77.98 | 76.00 | 83.43 | 78.81 | 80.40 |
| 3 | ✓(2) | X | 81.72 | 78.30 | 86.75 | 88.00 | 75.43 | 82.74 | 80.48 |
| 4 | ✓(2) | ✓ | 85.71 | 80.49 | 93.10 | 94.29 | 77.14 | 86.84 | 84.38 |
| 5 | ✓(1.5) | ✓ | 81.72 | 80.76 | 82.99 | 83.43 | 80.00 | 82.02 | 81.38 |
| 6 | ✓(2.5) | ✓ | 82.86 | 79.29 | 87.73 | 89.14 | 76.57 | 83.87 | 81.68 |
| 7 | ✓(3) | ✓ | 80.29 | 80.65 | 82.40 | 81.72 | 78.86 | 80.43 | 79.75 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Topic Modeling
MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening
Yongqi Shao
Shanghai Jiao Tong UniversityShanghaiChina
,
Bingxin Mei
Shanghai Jiao Tong UniversityShanghaiChina
,
Cong Tan
Shanghai Jiao Tong UniversityShanghaiChina
,
Hong Huo
Shanghai Jiao Tong UniversityShanghaiChina
and
Tao Fang
Shanghai Jiao Tong UniversityShanghaiChina
(2025)
Abstract.
Early screening for Alzheimer’s Disease (AD) through speech presents a promising non-invasive approach. However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. MoTAS leverages Text-to-Speech (TTS) augmentation to increase data volume and employs a Mixture of Experts (MoE) mechanism to improve multimodal feature selection, jointly enhancing model generalization. The process begins with automatic speech recognition (ASR) to obtain accurate transcriptions. TTS is then used to synthesize speech that enriches the dataset. After extracting acoustic and text embeddings, the MoE mechanism dynamically selects the most informative features, optimizing feature fusion for improved classification. Evaluated on the ADReSSo dataset, MoTAS achieves a leading accuracy of 85.71%, outperforming existing baselines. Ablation studies further validate the individual contributions of TTS augmentation and MoE in boosting classification performance. These findings highlight the practical value of MoTAS in real-world AD screening scenarios, particularly in data-limited settings.
Alzheimer’s Disease (AD), Speech-Based AD Screening, Text-to-Speech (TTS) Augmentation, Mixture of Experts (MoE)
††copyright: acmlicensed††journalyear: 2025††doi: 10.1145/3746027.3755536††conference: Proceedings of the 33rd ACM International Conference on Multimedia; October 27–31, 2025; Dublin, Ireland††isbn: 979-8-4007-2035-2/2025/10††submissionid: 4317††ccs: Applied computing Health informatics
1. Introduction
Alzheimer’s disease is a progressive neurodegenerative disorder that primarily affects cognitive functions, memory, and language abilities. As the most common cause of dementia, its prevalence is rising sharply, with an estimated 55 million people affected globally. This number is projected to reach 139 million by 2050 due to aging populations(Jeon et al., 2024).
The growing burden of AD poses significant challenges to healthcare systems, leading to escalating care costs, economic strain, and profound social impacts. Despite extensive research into potential treatments, no cure currently exists. This underscores the critical importance of early diagnosis in slowing disease progression and improving patient outcomes.
Traditional methods for diagnosing AD include clinical assessments, neuroimaging techniques such as magnetic resonance imaging (MRI), positron emission tomography (PET) scans, and cerebrospinal fluid (CSF) biomarker analysis. While these methods provide valuable insights into disease progression, they are expensive, require specialized resources, and are unsuitable for large-scale early screening (Passeri et al., 2022; Jha and Mukhopadhaya, 2020). CSF testing also involves invasive lumbar puncture, causing discomfort and reducing compliance (Bharati et al., 2022; McGeer et al., 1986). Moreover, such tools often detect AD only at later stages, limiting the effectiveness of potential interventions. These limitations underscore the urgent need for a non-invasive, cost-effective, and scalable early detection approach.
Recent studies indicate that speech-based analysis is a promising alternative for AD detection, as language impairments often appear in the early stages. AD patients typically show speech features such as pauses, hesitations, reduced fluency, pronunciation errors, and lexical deficits (Yang et al., 2022). With machine learning and deep learning models, speech analysis can automatically and objectively capture subtle linguistic and acoustic patterns linked to cognitive decline (Yang et al., 2022; Luz et al., 2018). Compared to conventional diagnostics, it offers a more practical solution for early screening and timely intervention.
However, despite its potential, existing speech-based AD detection methods still face several challenges(Ding et al., 2024). The limited availability of datasets constrains the generalization ability of deep learning models across diverse populations, making them prone to overfitting. Additionally, many models treat all features equally without adaptive selection, limiting their ability to capture fine-grained cues like speech rhythm and articulation errors. Moreover, state-of-the-art deep learning approaches often require substantial computational resources, limiting their practicality for real-time clinical applications.
To address these limitations, we propose MoTAS, a speech-based Alzheimer’s screening framework that leverages TTS-augmented speech and MoE-guided feature selection. Figure 1 illustrates the MoTAS pipeline. Our key contributions include:
- •
We propose a TTS data augmentation strategy that synthesizes speech associated with both AD and cognitively normal(CN) control groups, aiming to mitigate data scarcity and enhance model generalization.
- •
A MoE-guided feature selection mechanism is introduced to adaptively select features from acoustic and linguistic modalities, thereby optimizing feature utilization and reducing redundancy.
- •
Extensive experiments on the ADReSSo dataset demonstrate that our proposed framework significantly outperforms existing speech-based methods, achieving an accuracy of 85.71%.
By addressing challenges related to data availability, feature selection efficiency, and computational constraints, our approach advances automated, non-invasive, and scalable AD screening, providing a practical solution for early detection in real-world applications.
2. Related Work
In this section, we first introduce the key advances in TTS technology for data augmentation, then discuss the role of MoE in adaptive feature selection, and finally review existing speech-based AD detection methods.
2.1. TTS for Data Augmentation in Speech Processing
In recent years, TTS technology has made significant advancements, transitioning from concatenative and parametric models to deep learning-driven approaches. Modern TTS models, such as Fish-Speech(Liao et al., 2024), VITS(Kim et al., 2021), WaveNet(Van Den Oord et al., 2016), and Tacotron(Wang et al., 2017), have greatly enhanced the naturalness and intelligibility of synthesized speech. These models utilize sequence-to-sequence architectures and advanced waveform generation techniques, enabling the production of high-quality speech with human-like prosody and articulation. As a result, TTS has been widely adopted in assistive technologies, virtual assistants, and speech synthesis research.
TTS for data augmentation has proven effective in speech-related tasks, particularly in addressing data scarcity. It has been applied in ASR(Yang et al., 2025; Do et al., 2024), speech emotion recognition(SER)(Latif et al., 2023; Praseetha and Joby, 2022), and accent adaptation(Do et al., 2024; Tan et al., 2021), significantly improving model robustness and performance. However, its potential remains largely unexplored in AD detection. Current AD speech datasets are often small and imbalanced, making models prone to overfitting and limiting their generalizability. In this study, we leverage TTS-augmented synthetic AD speech to expand the dataset, preserving critical linguistic and acoustic features associated with cognitive decline. This enhancement boosts classification accuracy and improves the reliability of speech-based AD screening.
2.2. MoE-Guided Adaptive Feature Selection
Mixture of Experts (MoE) is a deep learning approach that improves model efficiency by dynamically selecting specialized subnetworks (experts) for different input types. It has been widely used in NLP, speech, and vision tasks to enhance scalability in large-scale learning. Recent models like Google’s Switch Transformer (Fedus et al., 2022) and DeepSeek-V3 (Liu et al., 2024a) show that MoE can reduce computation while preserving high capacity.
In speech-related tasks, MoE has been employed to optimize feature selection and processing, leading to improved model generalization. Research has shown its effectiveness in ASR (Hsu et al., 2021), speaker verification(Wang et al., 2025; Gaur et al., 2021), and SER(Hyeon et al., 2024; Liu et al., 2024b; Salman et al., 2025),where it efficiently allocates different experts to process prosodic, phonetic, and spectral features, outperforming conventional deep learning approaches. Despite its success in various speech tasks, MoE has been underutilized in speech-based AD detection. While existing AD classification models often process acoustic and linguistic features separately, they typically lack mechanisms to adaptively prioritize the most informative cues within each modality, such as speech rhythm, articulation errors, and lexical patterns. Furthermore, many methods treat all input features equally, which may lead to suboptimal learning and inefficient computation.
To address these limitations, we propose an MoE-guided mechanism that adaptively selects expert networks based on feature types, enabling dynamic focus on salient linguistic and paralinguistic cues. This hierarchical design enhances AD detection accuracy while reducing redundancy and computational cost.
2.3. Speech-Based AD Detection Methods
Existing speech-based AD detection methods primarily extract acoustic and linguistic features from spontaneous speech recordings and their transcriptions(Mahajan and Baths, 2021; Cui et al., 2021; Luz et al., 2021). Traditional approaches rely on handcrafted features, such as speech rate, pause duration, pitch variation, and Mel-Frequency Cepstral Coefficients (MFCCs), which are then classified using Support Vector Machines (SVMs) and Random Forests(Luz et al., 2018; Balagopalan and Novikova, 2021; Chen et al., 2021). Additionally, linguistic features derived from text transcriptions, such as lexical diversity, syntactic complexity, and word repetition patterns, have been explored to detect early cognitive decline(Eyigoz et al., 2020; Fraser et al., 2015).
With the rise of deep learning, feature extraction has shifted from manual engineering to data-driven learning, significantly improving model performance. CNNs and RNNs have been successfully applied to Mel spectrograms and raw audio waveforms, capturing complex temporal and spectral variations(Gupta et al., 2021). Meanwhile, self-supervised learning models such as Wav2Vec2(Gauder et al., 2021) enable feature extraction directly from raw speech, eliminating the need for manual feature engineering. In text analysis, Transformer-based models such as BERT(Devlin et al., 2019) and DistilBERT(Sanh et al., 2019) have been widely adopted to analyze transcribed speech, learning semantic coherence and syntactic changes associated with AD(Mirheidari et al., 2021; Pan et al., 2021). Furthermore, multimodal models integrating speech and text features have shown superior classification performance by leveraging complementary acoustic and linguistic markers(Wang et al., 2021; Zhu et al., 2021; Li and Zhang, 2024).
Despite these advancements, existing methods still face critical challenges, including limited dataset availability, suboptimal feature fusion, and high computational costs. To address these issues, we propose MoTAS, a multimodal framework that increases dataset size using TTS and enhances feature selection via a MoE mechanism. By combining synthetic speech generation with adaptive, fine-grained multimodal features selection, our approach enhances the robustness, accuracy, and scalability of automated AD detection, making it more practical for real-world deployment.
3. Methodology
The overall framework of our method is illustrated in Figure 2. Raw speech is transcribed using ASR and augmented with TTS. Both real and synthetic speech, along with their transcriptions, are encoded into multimodal features. These features are then refined through a MoE mechanism, subsequently fused, and used in the final step for classifying into AD or CN categories.
This section outlines the proposed MoTAS framework, concentrating on its key components: TTS for data augmentation, acoustic and text encoder, MoE for feature selection, and feature fusion and classification.
3.1. TTS for Data Augmentation
This study employs TTS to generate synthetic speech samples, thereby expanding the dataset while preserving disease-relevant acoustic characteristics. Since the original dataset consists solely of raw English speech recordings, we first apply ASR using Whisper (Radford et al., 2023) to obtain text transcriptions, creating paired audio-text data. The process is formulated as follows:
[TABLE]
Here, and represent the sets of raw speech samples and their corresponding text transcriptions, respectively. Let and represent the sets of AD and CN speech samples, respectively; and denote the corresponding ASR-transcribed texts for AD and CN speech samples, where and are the number of AD and CN samples in the original dataset.
Once transcriptions are obtained, a pre-trained TTS model is used to synthesize new speech samples. The generated synthetic speech retains the acoustic characteristics of the reference speaker while replacing the linguistic content with transcriptions from another speaker. The synthesis process is defined as:
[TABLE]
where , , and for AD samples, and similarly , , and for CN samples. Here, provides the speaker identity and provides the linguistic content. The TTS model synthesizes speech that combines the voice characteristics of with the transcript .
To augment the dataset further, each speech sample can be paired with multiple transcriptions from the same class (where and ). This intra-class pairing allows a single reference voice to be combined with various linguistic contents, producing a rich set of synthetic samples with consistent speaker identity and class label. Such augmentation improves data diversity while preserving class-specific characteristics.
In this study, we employ FishSpeech(Liao et al., 2024), a state-of-the-art TTS model designed for high-fidelity speaker-preserving synthesis, ensuring that the generated speech retains the original speaker’s prosody, rhythm, and articulation. We also balance the proportion of real and synthetic samples during training to prevent overfitting.
Although the speech and text come from different speakers, the TTS model preserves key acoustic features specific to the original speaker, which is crucial for ensuring that the synthetic speech accurately reflects disease-related vocal cues. This approach allows the synthetic speech to remain authentic, accurately reflecting the acoustic traits of the original speaker, thereby enhancing the reliability and accuracy of the dataset.
The final augmented speech is defined as:
[TABLE]
To prevent semantic redundancy, we rely on the fact that even when reusing textual content across speakers, the acoustic expression remains speaker-dependent due to variations in prosody and articulation. As a result, the ASR-transcribed text from real participants’ speech naturally reflect speaker-specific disfluencies or omissions, introducing lexical variation that enhances diversity at both the acoustic and linguistic levels.
Thus, after generating the augmented speech samples, we perform a second round of ASR on the synthetic audio, with the resulting transcriptions serving as new textual inputs for subsequent processing. The augmented transcription process is defined as follows:
[TABLE]
Finally, the augmented dataset is denoted as:
[TABLE]
This paired multimodal dataset, together with the original dataset, serves as the input for downstream tasks including feature extraction, selection, fusion, and classification in our framework.
Overall, the TTS-augmented speech mechanism enhances dataset diversity while maintaining data quality, allowing the model to generalize more effectively across different speech patterns and mitigating overfitting issues.
3.2. Acoustic and Text Encoder
For each speech sample and its corresponding text transcription , we extract features from both acoustic and text modalities, forming the input feature set:
[TABLE]
where represents deep phonetic features extracted using Wav2Vec2(Baevski et al., 2020), capturing nuanced acoustic patterns that reflect speech prosody, phonetics, and articulation dynamics; denotes MFCC-based temporal dynamics modeled by BiLSTM(Huang et al., 2015), reflecting the spectral envelope and short-term temporal structure; represents spectrogram-based features extracted using ResNet18(He et al., 2016), capturing the the local time-frequency energy distribution and prosodic cues; and corresponds to semantic and syntactic embeddings obtained from BERT(Devlin et al., 2019), encoding the high-level semantic and syntactic patterns of speech.
These features comprehensively represent both the acoustic and linguistic aspects of the data, allowing MoE to selectively integrate the most relevant multimodal information.
3.3. MoE for Feature Selection
Building upon Section 3.2, we introduce the MoE mechanism to dynamically select the most informative multimodal features, optimizing classification performance. For this mechanism, we consider the following subset of features without the Wav2Vec2 component:
[TABLE]
The exclusion of is because this feature is extracted by a pre-trained model and already contains rich representational capabilities. Since it is primarily focused on representing acoustic features, further selection is not necessary. On the other hand, the text features extracted by BERT still contain valuable semantic information and are considered crucial for the task, thus they are retained in the feature selection phase. This distinction ensures that MoE can focus on integrating and enhancing the discrimination power of spectral, temporal, and semantic features.
The MoE mechanism consists of expert networks, each designed to capture different patterns from the input feature vector. As shown in Figure 2, the three types of features are all input into the same MoE mechanism, but the MoE process for each feature is performed independently. Each expert network in the MoE mechanism produces an output for its corresponding feature type, with , , and being input separately as follows:
[TABLE]
where represents the expert network corresponding to each features.
To dynamically control the contribution of each expert, a gating network is employed to generate a weight vector , where each element corresponds to the importance of expert . For each feature type , a separate gating network is used to compute the expert weights:
[TABLE]
where , , and denotes the dimension of the corresponding input feature. The softmax function ensures that the weights are positive and sum to 1, effectively determining the importance of each expert for the given input.
The final outputs of the MoE mechanism are computed separately for each feature type, providing distinct outputs for MFCC, spectrogram, and text features:
[TABLE]
where are outputs from the respective expert networks for MFCC, spectrogram, and text features, and are the corresponding weights from the gating mechanism.
This independent processing ensures that the MoE mechanism effectively emphasizes the most relevant features for each type, enhancing the robustness and generalizability of the classification.
3.4. Feature Fusion and Classification
Following MoE-guided feature selection, we further enhance multimodal fusion by incorporating deep speech embeddings extracted via Wav2Vec2(Baevski et al., 2020). Unlike MFCC, spectrogram, and text features, which are compressed into unified representations through a modality-specific MoE mechanism, Wav2Vec2 embeddings are preserved in their raw or temporally-aggregated form. This approach leverages the pre-trained model’s capacity for phonetic-level representation learning, thus avoiding unnecessary transformations and maintaining the integrity of low-level acoustic details.
To achieve comprehensive fusion of these diverse representations, we concatenate the MoE-guided features with the Wav2Vec2 feature:
[TABLE]
where represents the deep speech embeddings extracted by Wav2Vec2 in Section 3.2. The MoE-guided features provide high-level acoustic and linguistic information, while Wav2Vec2 captures phonetic and low-level speech characteristics. This dual-layered fusion strategy ensures that the final feature representation effectively integrates both high-level semantics and fine-grained acoustic details.
The multi-layer perceptron (MLP) classifier consists of three fully connected layers with ReLU activations and dropout for regularization, tailored to handle complex interactions among fused features and prevent overfitting, particularly in data-limited clinical settingss:
[TABLE]
The fused multimodal representation is subsequently passed through the MLP classifier to produce a binary classification outcome:
[TABLE]
where denotes the classification results (AD vs. CN). The classifier is trained using a cross-entropy loss function to optimize accuracy and robustness:
[TABLE]
Here, represents the index of the -th sample. By leveraging MoE for feature selection and Wav2Vec2 for deep speech embedding fusion, our approach achieves a balance between capturing discriminative multimodal information and phonetic details, thereby enhancing classification performance.
The method design separates the roles of adaptive multimodal selection via MoE and enhancement of low-level acoustic features through Wav2Vec2. MoE targets the most discriminative features across spectral, temporal, and semantic dimensions, while Wav2Vec2 enriches the model’s capability to process phonetic irregularities and subtle speech characteristics.
These enhancements are aligned with clinical observations of Alzheimer’s-related speech impairments, which include semantic disorganization and phonetic irregularities such as pauses and stuttering. The fusion of MoE-guided features with Wav2Vec2 phonetic embeddings provides a robust, hierarchy-aware representation, enhancing the detection model’s expressiveness. We will demonstrate the effectiveness of this integrated framework with evaluations on the ADReSSo benchmark in subsequent sections.
4. Experiments
This section outlines the experimental setup used to assess the performance of the proposed framework. We first introduce the datasets and preprocessing steps, followed by the implementation details of model training and evaluation.
4.1. Datasets and Data Preprocessing
The dataset used in this study originates from the ADReSSo Challenge(Luz et al., 2021), which comprises English speech recordings of participants describing the “Cookie Theft” picture from the Boston Diagnostic Aphasia Exam(Goodglass et al., 2001). Participants are categorized into two groups: CN and Probable AD. The original training set consists of 166 participants, while the test set includes 71 participants, with both sets balanced in terms of gender, age, and diagnostic category. Additionally, the recordings contain speech from experimenters providing instructions or engaging in brief conversations.
The original audio signals were sampled at 16kHz. During preprocessing, sentence-level timestamp annotations provided in the dataset were used to extract and segment the speech data. To ensure speaker consistency and semantic clarity, only the participant’s utterances were retained. For segments shorter than 5 seconds, the original content was preserved and zero-padded to meet the target duration. Silent or invalid segments, including those with ASR failures, were excluded to maintain data quality.
All transcriptions generated by Whisper ASR were further cleaned to improve consistency and alignment across samples. This cleaning process included converting all characters to lowercase, correcting spelling errors, and filtering out non-linguistic symbols. For each cleaned speech segment, both acoustic and textual features were extracted separately. The resulting segments preserve linguistic coherence while maintaining the original acoustic structure, providing high-quality inputs for subsequent multimodal analysis.
4.2. Implementation Details
To address the limited sample size of the ADReSSo training set, we employed a speaker-consistent TTS data augmentation strategy using the Fish-Speech toolkit. Specifically, for each subject, we synthesized new speech samples by reusing transcripts from other participants while preserving the original speaker’s acoustic characteristics. This approach maintains the speaker’s vocal identity while introducing semantic and lexical diversity. The augmented samples were generated proportionally to the original class distribution (AD vs. CN), thereby preserving label balance. Ultimately, the training set was expanded to approximately three times its original size. To evaluate the impact of different augmentation scales on model performance, we conducted comparative experiments using training sets expended by 1.5, 2, and 2.5. The optimal augmentation ratio was selected based on validation performance. A comparison of sample sizes before and after augmentation is shown in Table 1. In these ablation studies, each setting retained the original training data and supplemented the remaining portion with newly generated augmented samples as needed.
To capture acoustic characteristics at different levels, we extracted three types of acoustic features. MFCCs were computed using the Librosa library with 40 Mel filter banks, a 25 ms frame length, and a 10 ms hop size. The resulting MFCC sequences (13-dimensional per frame) were fed into a two-layer bidirectional LSTM (hidden size 128), and the final hidden state was passed through a fully connected layer to obtain fixed-length embeddings of dimension . Mel spectrograms were resized to and processed by a pretrained ResNet18 with ImageNet weights. Segment-level features were aggregated using mean pooling, resulting in . Phoneme-level features were obtained by averaging the last hidden states of a wav2vec2-base-960h model, yielding .
Textual features were derived from Whisper ASR transcripts. After preprocessing, each sentence was encoded using a pretrained BERT-base model, and the [CLS] token embedding was used as the sentence-level representation, with . All extracted features were stored for downstream alignment, fusion, and classification tasks.
For feature selection, we adopted a MoE mechanism, where each feature (MFCC, spectrogram, and text) was associated with three expert networks (). A shared gating mechanism dynamically assigned weights to these experts based on the input. This framework enables the model to emphasize the most discriminative features and suppress redundant information, thereby improving performance and interpretability. Notably, Wav2Vec2 features were excluded from the MoE mechanism, as they already provide high-quality phonetic representations through self-supervised pretraining and were directly incorporated in the final fusion stage.
All model components were implemented using the PyTorch framework. Training was performed using the Adam optimizer with an initial learning rate of 0.0067 and a batch size of 32. Binary cross-entropy was used as the loss function. To ensure result reliability, each experiment was repeated five times with fixed random seeds, and the final performance metrics reported represent the average across all five runs.
5. Results and Analysis
This section presents the experimental results of our MoTAS framework for speech-based Alzheimer’s early screening, including comparisons with previous studies and an ablation study to evaluate the impact of key components.
5.1. Comparison with Previous Studies
We compared our proposed MoTAS framework with a range of existing speech-based AD detection models, including both acoustic-only approaches (Luz et al., 2021; Pan et al., 2021; Li and Zhang, 2024) and multimodal methods that combine speech with ASR-transcribed text (Luz et al., 2021; Zhu et al., 2021; Wang et al., 2021; Pan et al., 2021; Li and Zhang, 2024). The comparative results are summarized in Table 2.
As shown in the table, the proposed MoTAS framework achieves the highest overall classification accuracy (85.71%) on the ADReSSo test set, outperforming all baselines from both single- and multi-modal categories. It also obtains the best CN precision (93.10%), AD recall (94.29%) and AD F1-score (86.84%), indicating strong sensitivity and balanced detection performance. These results demonstrate the effectiveness of our design in capturing AD-related speech and language patterns with greater precision and robustness.
Compared to acoustic-only models such as Wav2Vec2+TB (Pan et al., 2021) and Whisper-TL medium (Li and Zhang, 2024), which achieve AD recall rates of 68.57% and 77.14% respectively, our framework shows substantial improvements. For example, MoTAS increases AD recall by over 17% relative to Whisper-TL medium, while also improving accuracy and F1-score. These gains are likely attributed to the combined advantages of multimodal input, expert diversity, and MoE-guided adaptive feature selection, rather than data augmentation alone.
Among state-of-the-art multimodal systems, including WavBERT Mp (Zhu et al., 2021), TDNN-ASR-M5 (Pan et al., 2021), and Whisper-TL-FTP (Li and Zhang, 2024), our method remains the top-performing model. Although WavBERT Mp achieves a strong AD precision of 87.10% and CN recall of 88.89%, MoTAS outperforms it across several key metrics, including AD F1-score (86.84% vs. 81.82%), AD recall (94.29% vs. 77.14%), and accuracy (85.71% vs. 83.10%). These results reflect the complementary benefits of TTS data augmentation and adaptive expert selection enabled by the MoE mechanism.
Notably, several multimodal baselines exhibit class imbalance. For example, WavBERT Mb (Zhu et al., 2021) achieves only 68.57% AD recall, while C-Attention-Unified (Wang et al., 2021) shows a strong bias toward AD classification, achieving a recall of 87.22% for AD and 68.57% for CN. These outcomes suggest that naive modality fusion without adaptive control can lead to redundancy or modal dominance. In contrast, the MoE gating mechanism in our framework selectively emphasizes the most informative features for each input, improving both classification balance and model interpretability.
MoTAS also demonstrates robustness to ASR errors, which are common in spontaneous and cognitively impaired speech. The MoE gating mechanism effectively down-weights unreliable textual features, mitigating their impact on final predictions. Importantly, the high AD recall (94.29%) and F1-score (86.84%) are especially valuable in clinical screening scenarios, where reducing false negatives is critical for early diagnosis and intervention. By maintaining high sensitivity without compromising precision or overall accuracy, our model helps mitigate underdiagnosis risks.
In summary, the proposed MoTAS framework combines multimodal inputs, TTS-augmented speech, and MoE-guided adaptive feature selection to achieve balanced and superior performance across both AD and CN classes, demonstrating strong potential for real-world deployment in early-stage AD screening based on spontaneous speech.
5.2. Ablation Study
To evaluate the independent and synergistic contributions of the TTS-augmented speech data and MoE-guided feature selection mechanism in our framework, we conducted a comprehensive ablation study. As shown in Table 3, the MoE mechanism was evaluated under two conditions: without data augmentation (Experiment ID 1 and 2) and with 2 TTS augmentation, which yielded the best performance (Experiment ID 3 and 4). The TTS augmentation was further analyzed by comparing multiple augmentation factors, including none, 1.5, 2, 2.5, and 3 (Experiment ID 2, 5, 4, 6, and 7, respectively).
The results demonstrate that the MoE mechanism consistently enhances performance across settings. Without TTS augmentation, introducing MoE increased the test accuracy from 78.28% to 79.71% (ID1 vs. ID2), indicating its effectiveness under limited data conditions. When applied to the 2 augmented dataset, MoE further improved accuracy from 81.72% to 85.71% (ID3 vs. ID4), representing a notable 3.99% gain. In addition to accuracy, other metrics such as precision, recall, and F1-score also improved significantly, confirming MoE’s role in boosting robustness and discriminative capability. By dynamically weighting the importance of multimodal features, the MoE mechanism effectively reduces redundancy and enhances the model’s ability to capture AD-relevant acoustic and linguistic characteristics.
For the TTS agumentation, Figure 3 illustrates the influence of varying augmentation levels on test accuracy. As the augmentation factor increased from none to 2, the accuracy steadily improved, peaking at 85.71%. However, further augmentation to 2.5 and 3 led to a decline in accuracy to 82.86% and 80.29%, respectively. This performance degradation may be attributed to the reduced proportion of real samples, which leads the model to overfit to the synthetic distribution and impairs its generalization ability.
Therefore, effective data augmentation should not only focus on increasing quantity but also ensure the quality of synthetic data. Moderate augmentation improves sample diversity and mitigates overfitting, while excessive augmentation can negatively impact performance. Based on these findings, we selected 2 TTS augmentation as the optimal configuration, balancing dataset richness with training stability.
In summary, the ablation study validates the complementary strengths of our MoTAS framework. TTS enhances data diversity and generalization, while MoE improves the selection of discriminative features. The integration of both significantly boosts classification accuracy and robustness, providing a solid foundation for scalable and effective speech-based AD screening.
6. Conclusion
This study proposes an innovative framework MoTAS, which combines TTS-augmented speech data with MoE-guided feature selection to improve speech-based AD early screening. By expanding the training set with synthetic speech and adaptively selecting multimodal features, the proposed approach effectively addresses key challenges such as data scarcity, feature redundancy, and model overfitting.
Experiments on the ADReSSo dataset demonstrate that our method significantly outperforms existing speech-based models in both accuracy and robustness. The results confirm the synergistic effect of TTS augmentation and MoE-guided feature selection, which enhances model generalization while optimizing multimodal fusion under constrained computational resources.
This framework offers a flexible foundation for developing more efficient cognitive screening systems. Future work will explore its applicability to cross-lingual and cross-dataset scenarios, as well as further optimize computational efficiency to enable real-time clinical deployment. Overall, our findings highlight the potential of synthetic data generation and adaptive feature fusion in advancing early Alzheimer’s screening toward more efficient, reliable, and scalable solutions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav 2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- 3Balagopalan and Novikova (2021) Aparna Balagopalan and Jekaterina Novikova. 2021. Comparing acoustic-based approaches for Alzheimer’s disease detection. ar Xiv preprint ar Xiv:2106.01555 (2021).
- 4Bharati et al. (2022) Subrato Bharati, Prajoy Podder, Dang Ngoc Hoang Thanh, and VB Surya Prasath. 2022. Dementia classification using MR imaging and clinical data with voting based machine learning models. Multimedia Tools and Applications 81, 18 (2022), 25971–25992.
- 5Chen et al. (2021) Jun Chen, Jieping Ye, Fengyi Tang, and Jiayu Zhou. 2021. Automatic detection of Alzheimer’s disease using spontaneous speech only. In Interspeech , Vol. 2021. 3830.
- 6Cui et al. (2021) Xia Cui, Amila Gamage, Terry Hanley, and Tingting Mu. 2021. Identifying indicators of vulnerability from short speech segments using acoustic and textual features. Proceedings of Interspeech 2021 (2021), 1569–1573.
- 7Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) . 4171–4186.
- 8Ding et al. (2024) Kewen Ding, Madhu Chetty, Azadeh Noori Hoshyar, Tanusri Bhattacharya, and Britt Klein. 2024. Speech based detection of Alzheimer’s disease: a survey of AI techniques, datasets and challenges. Artificial Intelligence Review 57, 12 (2024), 325.
