DocuMint: Docstring Generation for Python using Small Language Models

Bibek Poudel; Adam Cook; Sekou Traore; Shelah Ameli

arXiv:2405.10243·cs.SE·May 17, 2024·1 cites

DocuMint: Docstring Generation for Python using Small Language Models

Bibek Poudel, Adam Cook, Sekou Traore, Shelah Ameli

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper evaluates small language models for generating high-quality Python docstrings, introduces a large dataset called DocuMint for fine-tuning, and demonstrates performance improvements through fine-tuning and benchmarking.

Contribution

It introduces DocuMint, a large dataset for fine-tuning language models for docstring generation, and provides a comprehensive evaluation of small language models' effectiveness.

Findings

01

Llama 3 8B achieved the best quantitative scores.

02

CodeGemma 7B scored highest in human evaluation.

03

Fine-tuning with DocuMint improves model performance significantly.

Abstract

Effective communication, specifically through documentation, is the beating heart of collaboration among contributors in software development. Recent advancements in language models (LMs) have enabled the introduction of a new type of actor in that ecosystem: LM-powered assistants capable of code generation, optimization, and maintenance. Our study investigates the efficacy of small language models (SLMs) for generating high-quality docstrings by assessing accuracy, conciseness, and clarity, benchmarking performance quantitatively through mathematical formulas and qualitatively through human evaluation using Likert scale. Further, we introduce DocuMint, as a large-scale supervised fine-tuning dataset with 100,000 samples. In quantitative experiments, Llama 3 8B achieved the best performance across all metrics, with conciseness and clarity scores of 0.605 and 64.88, respectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

docu-mint/documint
noneOfficial

Models

🤗
documint/google-codegemma-2b-documint
model· 15 dl· ♡ 4
15 dl♡ 4

Datasets

documint/DocuMint
dataset· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsLLaMA