V-SAT: Video Subtitle Annotation Tool
Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman

TL;DR
V-SAT is a comprehensive, automated framework that improves subtitle quality by detecting and correcting various issues using advanced AI models, reducing manual editing and enhancing synchronization and accuracy.
Contribution
The paper introduces V-SAT, the first unified system combining LLMs, VLMs, image processing, and ASR for automatic, comprehensive subtitle correction and annotation.
Findings
SUBER score reduced from 9.6 to 3.54
F1-scores of ~0.80 for image mode issues
High human-in-the-loop validation quality
Abstract
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
