SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning
Jinjun Peng, Magnus Saebo, Tianjun Zhong, Yi-Jie Cheng, Junfeng Yang, Baishakhi Ray, Simin Chen, Yangruibo Ding

TL;DR
This paper introduces Repository-Centric Learning (RCL), a new paradigm for training small language models to specialize in specific code repositories, leading to improved performance and efficiency over traditional task-centric methods.
Contribution
The paper proposes RCL as a paradigm shift, designing a repository-focused training approach that enhances small language models' ability to understand and generalize within specific software environments.
Findings
SWE-Spot-4B models outperform larger models on multiple SWE tasks.
RCL improves training sample efficiency and reduces inference costs.
Repository mastery complements general coding capabilities.
Abstract
The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the "physics" of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Privacy-Preserving Technologies in Data · Domain Adaptation and Few-Shot Learning
