CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment
Papa S\'ega Wade, Mihai Andries, Ioannis Kanellos, Thierry Moudenc

TL;DR
This paper presents a chunk-based multi-self-supervised learning fusion approach for automatic fluency assessment, improving accuracy by integrating phonetic, prosodic, and linguistic features with hierarchical neural networks.
Contribution
It introduces a novel chunk-based fusion method combining multiple SSL models and hierarchical neural networks for enhanced fluency evaluation.
Findings
Improves F1-score by 2.8 points on Speechocean762
Increases Pearson correlation by 6.2 points on Speechocean762
Surpasses existing segmentation baselines in fluency assessment
Abstract
Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability
