Semantic Source Code Segmentation using Small and Large Language Models

Abdelhalim Dahou; Ansgar Scherp; Sebastian Kurten; Brigitte Mathiak; Madhu Chauhan

arXiv:2507.08992·cs.SE·July 15, 2025

Semantic Source Code Segmentation using Small and Large Language Models

Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

PDF

TL;DR

This paper presents automated methods for segmenting R and Python source code into coherent parts using small and large language models, demonstrating that smaller models fine-tuned on domain-specific data outperform larger pre-trained models.

Contribution

It introduces two novel domain-specific approaches for code segmentation, along with a new annotated dataset, and compares their effectiveness across different language models.

Findings

01

Context-based line-by-line analysis outperforms range-based segmentation.

02

Smaller models like CodeBERT and CodeT5+ are more effective than larger LLMs.

03

Fine-tuning small models on limited annotated data yields superior results.

Abstract

Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.