From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs
Alessandro Lucca, Francesco Pierri

TL;DR
This study evaluates state-of-the-art ASR models for Italian TV subtitling, revealing they are useful tools but still require human oversight for industry-grade accuracy, and introduces a supporting cloud infrastructure.
Contribution
It provides a comprehensive case study on ASR performance in real-world Italian media subtitling, including benchmarking four models and proposing a human-in-the-loop workflow.
Findings
Current ASR models do not fully meet industry accuracy standards.
Models significantly improve productivity when combined with human review.
A cloud-based infrastructure supports effective human-in-the-loop subtitling workflows.
Abstract
Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech Recognition and Synthesis · Speech and Audio Processing
