Multi-language Video Subtitle Dataset for Image-based Text Recognition

Thanadol Singkhornart; Olarik Surinta

arXiv:2411.05043·cs.CV·November 11, 2024

Multi-language Video Subtitle Dataset for Image-based Text Recognition

Thanadol Singkhornart, Olarik Surinta

PDF

Open Access

TL;DR

This paper introduces a comprehensive multilingual video subtitle dataset with diverse characters and challenging backgrounds, aimed at advancing text recognition research in videos.

Contribution

It provides a large, diverse dataset of 4,224 subtitle images across multiple languages, supporting development of robust multilingual text recognition models.

Findings

01

Enhanced training data for multilingual OCR models

02

Improved accuracy in recognizing complex subtitle backgrounds

03

Facilitates research in deep learning for video text recognition

Abstract

The Multi-language Video Subtitle Dataset is a comprehensive collection designed to support research in text recognition across multiple languages. This dataset includes 4,224 subtitle images extracted from 24 videos sourced from online platforms. It features a wide variety of characters, including Thai consonants, vowels, tone marks, punctuation marks, numerals, Roman characters, and Arabic numerals. With 157 unique characters, the dataset provides a resource for addressing challenges in text recognition within complex backgrounds. It addresses the growing need for high-quality, multilingual text recognition data, particularly as videos with embedded subtitles become increasingly dominant on platforms like YouTube and Facebook. The variability in text length, font, and placement within these images adds complexity, offering a valuable resource for developing and evaluating deep…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques