Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models
Ji Won Yoon, Hyung Yong Kim, Hyeonseung Lee, Sunghwan Ahn, and Nam Soo, Kim

TL;DR
This paper introduces the Oracle Teacher, a novel CTC-based teacher model that uses both source inputs and output labels to improve knowledge distillation, leading to better student models and faster training.
Contribution
The paper proposes a new Oracle Teacher model for CTC sequence models that leverages target information, with a training strategy to prevent trivial copying and enhance distillation effectiveness.
Findings
Improves student model performance in speech and text recognition tasks.
Reduces training time for teacher models significantly.
Demonstrates effectiveness of target-aware distillation in sequence models.
Abstract
Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Residual Connection · Dense Connections · Absolute Position Encodings · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer
