TL;DR
This paper proposes a dual decoding approach to jointly generate captions and subtitles, improving their consistency and synchronization with minimal additional computational cost.
Contribution
It introduces a dual decoding scheme that tightly couples captioning and subtitling tasks, enhancing quality without increasing model size or training complexity.
Findings
Improved caption and subtitle consistency and synchronization
Virtually no increase in model size or training complexity
Effective joint generation of captions and subtitles
Abstract
As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs. Automatic captioning and subtitling often need to be tightly intertwined to achieve an appropriate level of consistency and synchronization with each other and with the video signal. In this work, we assess a dual decoding scheme to achieve a strong coupling between these two tasks and show how adequacy and consistency are increased, with virtually no additional cost in terms of model size and training complexity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
