JCAPT: A Joint Modeling Approach for CAPT
Tzu-Hsuan Yang, Yue-Yang He, and Berlin Chen

TL;DR
This paper introduces JCAPT, a novel joint modeling framework for CAPT that combines phonological features, state space models, and prompting to improve pronunciation assessment and mispronunciation detection, outperforming previous methods.
Contribution
It is the first to integrate phonological attribution, SSM-based modeling, and prompting in a unified CAPT framework, enhancing interpretability and temporal reasoning.
Findings
JCAPT outperforms prior methods on speechocean762 benchmark.
Significant improvements in mispronunciation detection accuracy.
Enhanced interpretability and temporal reasoning in pronunciation assessment.
Abstract
Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
MethodsAdaptive Pseudo Augmentation · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
