Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers
Alena Tsanda, Elena Bruches

TL;DR
This paper introduces a new multimodal dataset of Russian scientific papers, including texts, tables, and figures, and evaluates existing language models for automatic summarization.
Contribution
It presents a novel multimodal dataset for Russian scientific papers and benchmarks two language models on the summarization task.
Findings
YandexGPT outperforms Gigachat in summarization quality.
The dataset enables multimodal scientific paper summarization research.
Models show varying effectiveness across different paper modalities.
Abstract
The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
