TL;DR
This paper introduces RuREBus, a large corpus of Russian strategic planning documents created through a semi-automated annotation process, enabling new language technology applications and insights for e-government research.
Contribution
It presents a novel corpus creation pipeline combining machine learning and manual correction for Russian strategic planning texts.
Findings
Successful creation of a large annotated corpus
Demonstrated pipeline for semi-automated text annotation
Potential for new insights in e-government research
Abstract
In this paper we present a corpus of Russian strategic planning documents, RuREBus. This project is grounded both from language technology and e-government perspectives. Not only new language sources and tools are being developed, but also their applications to e-goverment research. We demonstrate the pipeline for creating a text corpus from scratch. First, the annotation schema is designed. Next texts are marked up using human-in-the-loop strategy, so that preliminary annotations are derived from a machine learning model and are manually corrected. The amount of annotated texts is large enough to showcase what insights can be gained from RuREBus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
