Russian-Language Multimodal Dataset for Automatic Summarization of   Scientific Papers

Alena Tsanda; Elena Bruches

arXiv:2405.07886·cs.CL·May 14, 2024

Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Alena Tsanda, Elena Bruches

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new multimodal dataset of Russian scientific papers, including texts, tables, and figures, and evaluates existing language models for automatic summarization.

Contribution

It presents a novel multimodal dataset for Russian scientific papers and benchmarks two language models on the summarization task.

Findings

01

YandexGPT outperforms Gigachat in summarization quality.

02

The dataset enables multimodal scientific paper summarization research.

03

Models show varying effectiveness across different paper modalities.

Abstract

The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iis-research-team/summarization-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques