SDL: New data generation tools for full-level annotated document layout
Son Nguyen Truong

TL;DR
This paper introduces SDL, a data generation tool that creates richly annotated document images with detailed layout information, supporting low-resource languages and large-scale dataset creation.
Contribution
The paper presents a novel tool for generating fully annotated document images with detailed layout information, including a large Vietnamese dataset and instructions for multilingual extension.
Findings
Generated 320,000 Vietnamese synthetic document images.
Enabled large-scale dataset creation for low-resource languages.
Facilitated detailed layout annotation for document processing.
Abstract
We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position. It also enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text. The data generation tools come with a dataset of 320000 Vietnamese synthetic document images and an instruction to generate a dataset of similar size in other languages. The repository can be found at: https://github.com/tson1997/SDL-Document-Image-Generation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Natural Language Processing Techniques
