Text-Conditioned Transformer for Automatic Pronunciation Error Detection
Zhan Zhang, Yuehai Wang, Jianyi Yang

TL;DR
This paper introduces a text-conditioned Transformer model for automatic pronunciation error detection that leverages target text as a condition, enabling end-to-end error detection with improved accuracy and faster inference.
Contribution
It proposes a novel Transformer-based approach that incorporates target text as a condition, fully utilizing prior knowledge and enhancing detection performance.
Findings
Achieved 8.4% relative improvement in F1 score on L2-Arctic dataset.
The method enables faster inference by operating in a feed-forward manner.
Outperforms baseline ASR-based APED models.
Abstract
Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignment process are independent, the prior knowledge about the target text is not fully utilized. In this paper, we propose to use the target text as an extra condition for the Transformer backbone to handle the APED task. The proposed method can output the error states with consideration of the relationship between the input speech and the target text in a fully end-to-end fashion.Meanwhile, as the prior target text is used as a condition for the decoder input, the Transformer works in a feed-forward manner instead of autoregressive in the inference stage, which can significantly boost the speed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Residual Connection
