StRuCom: A Novel Dataset of Structured Code Comments in Russian

Maria Dziuba; Valentin Malykh

arXiv:2505.11026·cs.CL·May 19, 2025

StRuCom: A Novel Dataset of Structured Code Comments in Russian

Maria Dziuba, Valentin Malykh

PDF

Open Access 1 Video

TL;DR

This paper introduces StRuCom, a large-scale dataset of structured Russian code comments, and demonstrates that fine-tuning models on this dataset significantly improves code comment generation quality for Russian compared to previous methods.

Contribution

The paper presents the first large-scale Russian code comment dataset, StRuCom, and shows that fine-tuning models on it enhances generation performance over baselines.

Findings

01

Fine-tuning Qwen2.5-Coder models improves quality metrics.

02

StRuCom outperforms translated datasets in preserving terminology.

03

Automated validation ensures dataset compliance with standards.

Abstract

Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

StRuCom: A Novel Dataset of Structured Code Comments in Russian· underline

Taxonomy

TopicsNatural Language Processing Techniques · Software Engineering Research · Topic Modeling