StRuCom: A Novel Dataset of Structured Code Comments in Russian
Maria Dziuba, Valentin Malykh

TL;DR
This paper introduces StRuCom, a large-scale dataset of structured Russian code comments, and demonstrates that fine-tuning models on this dataset significantly improves code comment generation quality for Russian compared to previous methods.
Contribution
The paper presents the first large-scale Russian code comment dataset, StRuCom, and shows that fine-tuning models on it enhances generation performance over baselines.
Findings
Fine-tuning Qwen2.5-Coder models improves quality metrics.
StRuCom outperforms translated datasets in preserving terminology.
Automated validation ensures dataset compliance with standards.
Abstract
Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Topic Modeling
