CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection

David Ortiz-Perez; Manuel Benavent-Lledo; Javier Rodriguez-Juan; Jose Garcia-Rodriguez; David Tom\'as

arXiv:2506.01890·cs.LG·October 27, 2025

CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection

David Ortiz-Perez, Manuel Benavent-Lledo, Javier Rodriguez-Juan, Jose Garcia-Rodriguez, David Tom\'as

PDF

Open Access

TL;DR

CogniAlign introduces a word-level multimodal speech alignment approach with gated cross-attention and prosodic cues, significantly improving Alzheimer's detection accuracy by integrating audio and textual data.

Contribution

This work presents a novel word-level alignment and fusion mechanism for multimodal speech analysis, enhancing early Alzheimer's detection beyond existing methods.

Findings

01

Achieved 87.35% accuracy on ADReSSo dataset

02

Outperformed state-of-the-art methods in Alzheimer's detection

03

Demonstrated the effectiveness of prosodic cues and attention-based fusion

Abstract

Early detection of cognitive disorders such as Alzheimer's disease is critical for enabling timely clinical intervention and improving patient outcomes. In this work, we introduce CogniAlign, a multimodal architecture for Alzheimer's detection that integrates audio and textual modalities, two non-intrusive sources of information that offer complementary insights into cognitive health. Unlike prior approaches that fuse modalities at a coarse level, CogniAlign leverages a word-level temporal alignment strategy that synchronizes audio embeddings with corresponding textual tokens based on transcription timestamps. This alignment supports the development of token-level fusion techniques, enabling more precise cross-modal interactions. To fully exploit this alignment, we propose a Gated Cross-Attention Fusion mechanism, where audio features attend over textual representations, guided by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems