Semantic Source Code Segmentation using Small and Large Language Models
Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

TL;DR
This paper presents automated methods for segmenting R and Python source code into coherent parts using small and large language models, demonstrating that smaller models fine-tuned on domain-specific data outperform larger pre-trained models.
Contribution
It introduces two novel domain-specific approaches for code segmentation, along with a new annotated dataset, and compares their effectiveness across different language models.
Findings
Context-based line-by-line analysis outperforms range-based segmentation.
Smaller models like CodeBERT and CodeT5+ are more effective than larger LLMs.
Fine-tuning small models on limited annotated data yields superior results.
Abstract
Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
