Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

Artem Fedorchenko; Tanel Alum\"ae

arXiv:2501.05234·cs.CL·January 10, 2025

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

Artem Fedorchenko, Tanel Alum\"ae

PDF

TL;DR

This paper introduces a semi-supervised learning approach using fine-tuned Whisper models and LLM-based post-editing to generate high-quality Estonian TV subtitles, showing significant improvements in subtitle accuracy.

Contribution

It combines pseudo-labeling with LLM-based post-editing to enhance subtitle quality, a novel integration for Estonian language content.

Findings

01

Pseudo-labeling improves subtitle quality with unlabeled data.

02

LLM-based editing at test time enhances accuracy.

03

Training-time LLM editing does not provide additional benefits.

Abstract

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.