Oracle Teacher: Leveraging Target Information for Better Knowledge   Distillation of CTC Models

Ji Won Yoon; Hyung Yong Kim; Hyeonseung Lee; Sunghwan Ahn; and Nam Soo; Kim

arXiv:2111.03664·cs.LG·August 14, 2023

Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models

Ji Won Yoon, Hyung Yong Kim, Hyeonseung Lee, Sunghwan Ahn, and Nam Soo, Kim

PDF

Open Access

TL;DR

This paper introduces the Oracle Teacher, a novel CTC-based teacher model that uses both source inputs and output labels to improve knowledge distillation, leading to better student models and faster training.

Contribution

The paper proposes a new Oracle Teacher model for CTC sequence models that leverages target information, with a training strategy to prevent trivial copying and enhance distillation effectiveness.

Findings

01

Improves student model performance in speech and text recognition tasks.

02

Reduces training time for teacher models significantly.

03

Demonstrates effectiveness of target-aware distillation in sequence models.

Abstract

Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Residual Connection · Dense Connections · Absolute Position Encodings · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer