A systematic mapping study of developer social network research
Steffen Herbold, Aynur Amirfallah, Fabian Trautsch, Jens Grabowski

TL;DR
This systematic mapping study reviews 255 research papers on developer social networks, highlighting research trends, common topics, data limitations, and open issues in the field of software engineering.
Contribution
It provides a comprehensive overview of DSN research, mapping primary studies to research directions and identifying gaps and challenges in current methodologies.
Findings
Nearly half of studies focus on community structure.
Many studies use small sample sizes, affecting validity.
Open issues include inter-company collaboration and data source diversity.
Abstract
Developer social networks (DSNs) are a tool for the analysis of community structures and collaborations between developers in software projects and software ecosystems. Within this paper, we present the results of a systematic mapping study on the use of DSNs in software engineering research. We identified 255 primary studies on DSNs. We mapped the primary studies to research directions, collected information about the data sources and the size of the studies, and conducted a bibliometric assessment. We found that nearly half of the research investigates the structure of developer communities. Other frequent topics are prediction systems build using DSNs, collaboration behavior between developers, and the roles of developers. Moreover, we determined that many publications use a small sample size regarding the number of projects, which could be problematic for the external validity of…
| Search terms | Google Scholar | IEEE Xplore | ACM Digital Library | Springer Link | Elsevier Search | Scopus |
|---|---|---|---|---|---|---|
| developers network | 969,000 | 4,339 | 204,258 | 108,157 | 60,735 | 102 |
| developer social networks | 235,000 | 513 | 249,607 | 48,021 | 26,642 | 131 |
| collaborative networks OSS | 25,400 | 22 | 119,424 | 1,090 | 692 | 0 |
| Category | #Pubs. | Publications |
|---|---|---|
| Community Structure | ||
| General | 75 | \citeSallaho2013analyzing,amrit2004social,antwerp2010importance,avelino2019measuring,bana2018influence,batista2017collaboration,behfar2016intragroup,behfar2018knowledge,bidoki2018network,bird2006mining,bird2006mining1,bird2008latent,canfora2011social,cherry2008social,conaldi2010meso,crowston2005social,crowston2006hierarchy,dos2011bringing,gao2007network,gao2007towards,geipel2014communication,gloor2003visualization,gonzalez2004community,guilherme2017assessing,he2012applying,howison2006social,hu2008comparison,hu2012reputation,hu2016influence,huang2011relating,ichimura2015analysis,iyer2019requirements,jermakovics2011mining,jermakovics2013exploring,jiang2013understanding,joblin2015developer,kamei2008analysis,kidane2007correlating,leibzon2016social,lim2011evolving,lima2014coding,linaaker2019method,long2007social,lopez2004applying,lopez2008applying,madey2002open,meneely2009secure,meneely2010strengthening,meneely2010use,mergel2015open,nzeko2015social,qiu2019going,robertsa2006communication,schwind2008unveiling,singh2010small,sowe2014empirical,sureka2011using,surian2010mining,tamburri2019discovering,tan2007social,thung2013network,toral2010analysis,van2010importance,wagstrom2005social,wang2019investigating,wiggins2008social,wolf2009mining,xu2004exploration,xu2005open,xu2005topological,yu2013study,yu2014exploring,zanetti2012quantitative,zhang2014generative,zhang2015analyzing |
| DSN Evolution | 18 | \citeSaljemabi2018empirical,datta2011evolution,hannemann2013community,hong2011understanding,joblin2017evolutionary,kakimoto2006social,kavaler2017stochastic,kumar2013evolution,kumar2019studying,nakakoji2005understanding,ngamkajornwiwat2008exploratory,ryan2010modeling,sharma2011studying,van2010open,weiss2006evolution,yu2014exploring,zanetti2013rise,zhang2011network |
| Global SWE | 10 | \citeSahuja2003individual,avritzer2010coordination,cataldo2008communication1,de2007toward,ehrlich2006leveraging,ehrlich2012all,hinds2006structures,hossain2009social,sarker2011path,spinellis2006global |
| Team Formation | 6 | \citeScaglayan2013emergence,crowston2007self,hahn2006impact,hahn2008emergence,panichella2014evolution,singh2010developer |
| Impact on Code Quality | 6 | \citeSbettenburg2010studying,bettenburg2013studying,ccaglayan2016effect,datta2018does,hossain2008measuring,mockus2010organizational |
| Socio-technical Congruence | 5 | \citeScataldo2008socio,de2005seeking,kwan2011does,syeed2013socio,valetto2007using |
| Simulation | 4 | \citeShonsel2014software,honsel2015developer,honsel2015mining,yu2008mining |
| Community Smells | 2 | \citeScatolino2019gender,palomba2018beyond |
| Prediction | ||
| Bug Triage | 16 | \citeSbanitaan2013decoba,bhattacharya2010fine,bhattacharya2012automated,chen2010improving,jeong2009improving,sun2017enhancing,wang2013devnet,wu2011drex,wu2018empirical,xuan2012developer,yang2014utilizing,zanetti2013categorizing,zhang2012automated,zhang2013heterogeneous,zhang2014butter,zhang2014novel |
| Defect Prediction | 12 | \citeSabreu2009developer,bhattacharya2012graph,biccer2011defect,bird2009putting,hu2013using,meneely2008predicting,miranskyy2014effect,pinzger2008can,simpson2013changeset,wang2016analyzing,zhang2014mining,zhang2019file |
| Project Outcomes | 9 | \citeScataldo2012impact,jarczyk2018surgical,liu2007design,peng2018co,peng2019co,singh2011network,surian2013predicting,wang2012human,wang2012survival |
| Developers for Tasks in General | 7 | \citeSdravzdilova2012method,hossain2006actor,hu2018user,li2016task,mcdonald2003recommending,surian2011recommending,wan2018scsminer |
| Suitable Web Services | 3 | \citeSbianchini2015developers,bianchini2016role,bianchini2016social |
| Build Failures | 2 | \citeSschroter2010predicting,wolf2009predicting |
| Developers for Code Review | 1 | \citeSkerzazi2016can |
| Collaboration Behavior | ||
| General | 13 | \citeScohen2018large,damian2013role,duc2011impact,feczak2009measuring,feczak2011exploring,gharehyazie2017tracing,kerzazi2017knowledge,licorish2017exploring,omoronyia2009using,ortu2015measuring,wu2016effects,xuan2012measuring,yang2014social |
| Global SWE | 10 | \citeScataldo2008communication,chang2007out,damian2007collaboration,fonseca2006exploring,herbsleb2003empirical,mikawa2009removing,nguyen2008global,sarker2011role,urdangarin2008experiences,wolf2008does |
| Problems | 10 | \citeSbegel2010codebook,bernardi2012developers,bhowmik2016optimal,cataldo2006identification,damian2007awareness,ell2013identifying,orsila2009trust,Sapkota2020,wang2016diffusion,xuan2016converging |
| Inter-company collaboration behavior | 1 | \citeSteixeira2015lessons |
| Developer Roles | ||
| Identification | 18 | \citeSzhang2012empirical,meneely2010improving,bhattacharya2014determining,crowston2006core,datta2010social,dittrich2013network,huang2005mining,joblin2017classifying,lee2013github,licorish2014understanding,licorish2015communication,lim2010stakenet,lim2012stakerare,marczak2008information,pohl2008dynamic,sharma2017boundary,sowe2006identifying,yu2007mining |
| Onboarding | 9 | \citeSbird2007open,canfora2012who,casalnuovo2015developer,cheng2017developer,ducheneaut2005socialization,el2017periphery,gharehyazie2013social,gharehyazie2015developer,zhou2011does |
| Specialization | 1 | \citeSmaclean2011knowledge |
| Tools | 11 | \citeSborici2012proxiscientia,de2004technical,de2007supporting,gilbert2007codesaw,gote2019git,ogawa2007visualizing,ohira2005accelerating,ohira2005supporting,sarma2009tesseract,schwind2008svnnat,schwind2010tool |
| DSN Validity | 5 | \citeSmeneely2011socio,aljemabi2017empirical,nia2010validity,panichella2014developers,tymchuk2014collaboration |
| Datasets | 1 | \citeSmaclean2013apache |
| Data Source | #Pubs. | Publications |
|---|---|---|
| Forge | 64 | \citeSaljemabi2018empirical,allaho2013analyzing,bana2018influence,batista2017collaboration,behfar2016intragroup,behfar2018knowledge,bidoki2018network,caglayan2013emergence,catolino2019gender,cohen2018large,conaldi2010meso,dos2011bringing,dravzdilova2012method,el2017periphery,gao2007network,gao2007towards,hahn2006impact,hahn2008emergence,he2012applying,hu2016influence,hu2018user,huang2011relating,ichimura2015analysis,iyer2019requirements,jarczyk2018surgical,jiang2013understanding,kerzazi2016can,kerzazi2017knowledge,lee2013github,leibzon2016social,li2016task,lima2014coding,liu2007design,madey2002open,mergel2015open,ohira2005accelerating,ohira2005supporting,peng2018co,peng2019co,qiu2019going,Sapkota2020,singh2010small,singh2011network,surian2010mining,surian2011recommending,surian2013predicting,tamburri2019discovering,tan2007social,thung2013network,tymchuk2014collaboration,van2010importance,wan2018scsminer,wang2012human,wang2012survival,wang2019investigating,wu2016effects,xu2004exploration,xu2005open,xu2005topological,yu2013study,yu2014exploring,yu2014exploring,zhang2014generative,zhang2015analyzing |
| ITS | 49 | \citeSabreu2009developer,banitaan2013decoba,bettenburg2010studying,bhattacharya2010fine,bhattacharya2012automated,bhowmik2016optimal,cataldo2006identification,cataldo2008socio,cataldo2012impact,chen2010improving,crowston2005social,crowston2006core,crowston2006hierarchy,datta2010social,duc2011impact,ehrlich2012all,feczak2009measuring,feczak2011exploring,hong2011understanding,hossain2008measuring,hossain2009social,howison2006social,jeong2009improving,kumar2013evolution,kumar2019studying,licorish2014understanding,licorish2015communication,licorish2017exploring,long2007social,nguyen2008global,ortu2015measuring,sharma2011studying,sureka2011using,wang2013devnet,wolf2008does,wolf2009mining,wolf2009predicting,wu2011drex,wu2018empirical,xuan2012developer,yang2014utilizing,zanetti2012quantitative,zanetti2013categorizing,zanetti2013rise,zhang2012automated,zhang2013heterogeneous,zhang2014butter,zhang2014novel,zhou2011does |
| VCS | 41 | \citeSantwerp2010importance,avelino2019measuring,bird2009putting,casalnuovo2015developer,ccaglayan2016effect,cheng2017developer,de2004technical,de2005seeking,dittrich2013network,ell2013identifying,gonzalez2004community,gote2019git,guilherme2017assessing,huang2005mining,jermakovics2011mining,jermakovics2013exploring,joblin2015developer,joblin2017classifying,joblin2017evolutionary,kakimoto2006social,lopez2004applying,lopez2008applying,maclean2013apache,meneely2008predicting,meneely2009secure,meneely2010strengthening,meneely2011socio,miranskyy2014effect,mockus2010organizational,orsila2009trust,palomba2018beyond,pinzger2008can,pohl2008dynamic,schwind2008svnnat,schwind2008unveiling,schwind2010tool,sun2017enhancing,teixeira2015lessons,valetto2007using,van2010open,yu2007mining |
| ML | 23 | \citeSahuja2003individual,bird2006mining,bird2006mining1,bird2008latent,gharehyazie2015developer,gloor2003visualization,hossain2006actor,kamei2008analysis,kavaler2017stochastic,kidane2007correlating,nakakoji2005understanding,ngamkajornwiwat2008exploratory,nia2010validity,nzeko2015social,robertsa2006communication,sharma2017boundary,sowe2006identifying,toral2010analysis,weiss2006evolution,xuan2016converging,yu2008mining,zhang2012empirical,zhang2014mining |
| Other | 14 | \citeSamrit2004social,bianchini2015developers,bianchini2016role,bianchini2016social,borici2012proxiscientia,cataldo2008communication,damian2007awareness,de2007supporting,hu2008comparison,hu2012reputation,hu2013using,omoronyia2009using,wang2016analyzing,yang2014social |
| Survey | 13 | \citeSavritzer2010coordination,chang2007out,cherry2008social,ehrlich2006leveraging,hinds2006structures,lim2010stakenet,lim2011evolving,lim2012stakerare,mcdonald2003recommending,mikawa2009removing,sarker2011path,sarker2011role,urdangarin2008experiences |
| \hdashlineITS & VCS | 19 | \citeSaljemabi2017empirical,bernardi2012developers,bettenburg2013studying,bhattacharya2012graph,bhattacharya2014determining,biccer2011defect,datta2011evolution,datta2018does,honsel2014software,honsel2015developer,honsel2015mining,kwan2011does,linaaker2019method,meneely2010improving,meneely2010use,schroter2010predicting,simpson2013changeset,spinellis2006global,zhang2019file |
| ML & VCS | 14 | \citeSbird2007open,canfora2011social,canfora2012who,ducheneaut2005socialization,fonseca2006exploring,gharehyazie2017tracing,gilbert2007codesaw,hannemann2013community,ogawa2007visualizing,singh2010developer,sowe2014empirical,syeed2013socio,xuan2012measuring,zhang2011network |
| Survey & Other | 3 | \citeSdamian2007collaboration,damian2013role,marczak2008information |
| ML & ITS | 2 | \citeSmaclean2011knowledge,ryan2010modeling |
| ML & Other | 2 | \citeScrowston2007self,wagstrom2005social |
| Forge & Survey | 1 | \citeSde2007toward |
| ITS & Survey | 1 | \citeSherbsleb2003empirical |
| VCS & Forge | 1 | \citeSgeipel2014communication |
| \hdashlineITS, ML & VCS | 3 | \citeSgharehyazie2013social,panichella2014evolution,sarma2009tesseract |
| ML, ITS & Other | 1 | \citeSwiggins2008social |
| ITS, Survey & Other | 1 | \citeScataldo2008communication1 |
| ITS, VCS & Other | 1 | \citeSbegel2010codebook |
| \hdashlineITS, ML, CVS & Other | 2 | \citeSpanichella2014developers,wang2016diffusion |
| #Projects | #Pubs. | Publications |
|---|---|---|
| 1 | 76 | \citeSabreu2009developer,ahuja2003individual,avritzer2010coordination,bettenburg2010studying,bird2006mining,bird2006mining1,caglayan2013emergence,cataldo2006identification,cataldo2008communication,cataldo2008socio,cataldo2012impact,cherry2008social,damian2007awareness,datta2010social,datta2011evolution,datta2018does,ducheneaut2005socialization,ehrlich2012all,ell2013identifying,gonzalez2004community,he2012applying,hong2011understanding,honsel2014software,honsel2015developer,hossain2006actor,hu2008comparison,kamei2008analysis,kumar2013evolution,kwan2011does,li2016task,licorish2014understanding,licorish2015communication,licorish2017exploring,lim2010stakenet,lim2011evolving,lim2012stakerare,linaaker2019method,maclean2011knowledge,marczak2008information,mcdonald2003recommending,meneely2008predicting,meneely2009secure,meneely2010improving,meneely2010use,mikawa2009removing,miranskyy2014effect,mockus2010organizational,nakakoji2005understanding,ngamkajornwiwat2008exploratory,nguyen2008global,omoronyia2009using,orsila2009trust,pinzger2008can,pohl2008dynamic,robertsa2006communication,ryan2010modeling,sarma2009tesseract,schroter2010predicting,sharma2011studying,sharma2017boundary,simpson2013changeset,spinellis2006global,sureka2011using,syeed2013socio,toral2010analysis,urdangarin2008experiences,wolf2008does,wolf2009mining,wolf2009predicting,wu2011drex,yang2014utilizing,zanetti2013rise,zhang2011network,zhang2012automated,zhang2012empirical,zhang2014butter |
| 2-5 | 69 | \citeSaljemabi2018empirical,banitaan2013decoba,bernardi2012developers,bettenburg2013studying,bhattacharya2010fine,bhattacharya2012automated,bhattacharya2014determining,bhowmik2016optimal,biccer2011defect,bird2007open,bird2008latent,bird2009putting,canfora2011social,canfora2012who,cataldo2008communication1,ccaglayan2016effect,chang2007out,chen2010improving,crowston2007self,damian2013role,de2007supporting,dittrich2013network,duc2011impact,ehrlich2006leveraging,el2017periphery,gilbert2007codesaw,gloor2003visualization,hannemann2013community,honsel2015mining,hu2013using,jeong2009improving,jermakovics2011mining,jermakovics2013exploring,kakimoto2006social,kavaler2017stochastic,kerzazi2016can,kerzazi2017knowledge,kidane2007correlating,kumar2019studying,leibzon2016social,lopez2004applying,lopez2008applying,meneely2010strengthening,meneely2011socio,nia2010validity,nzeko2015social,ogawa2007visualizing,panichella2014evolution,sarker2011role,schwind2008svnnat,schwind2008unveiling,singh2010developer,sowe2006identifying,sun2017enhancing,van2010open,wang2013devnet,wang2016analyzing,wang2016diffusion,wiggins2008social,wu2018empirical,xuan2012developer,yang2014social,yu2007mining,yu2008mining,zanetti2013categorizing,zhang2013heterogeneous,zhang2014mining,zhang2014novel,zhang2019file |
| 6-10 | 12 | \citeSbhattacharya2012graph,gharehyazie2013social,gharehyazie2015developer,guilherme2017assessing,huang2005mining,joblin2015developer,joblin2017classifying,ortu2015measuring,palomba2018beyond,panichella2014developers,teixeira2015lessons,zhou2011does |
| 11-100 | 20 | \citeSaljemabi2017empirical,catolino2019gender,crowston2005social,de2007toward,dos2011bringing,geipel2014communication,gharehyazie2017tracing,hinds2006structures,hossain2008measuring,hossain2009social,joblin2017evolutionary,sarker2011path,sowe2014empirical,tamburri2019discovering,weiss2006evolution,xuan2012measuring,xuan2016converging,zanetti2012quantitative,zhang2014generative,zhang2015analyzing |
| 100 | 50 | \citeSallaho2013analyzing,antwerp2010importance,avelino2019measuring,batista2017collaboration,bianchini2015developers,bianchini2016role,bidoki2018network,casalnuovo2015developer,cheng2017developer,cohen2018large,conaldi2010meso,crowston2006core,crowston2006hierarchy,dravzdilova2012method,feczak2009measuring,feczak2011exploring,hahn2006impact,hahn2008emergence,howison2006social,hu2012reputation,hu2018user,huang2011relating,iyer2019requirements,jarczyk2018surgical,jiang2013understanding,lee2013github,liu2007design,long2007social,madey2002open,mergel2015open,ohira2005accelerating,peng2018co,peng2019co,Sapkota2020,singh2010small,singh2011network,surian2010mining,surian2011recommending,surian2013predicting,tan2007social,thung2013network,tymchuk2014collaboration,wan2018scsminer,wang2012human,wang2012survival,wang2019investigating,wu2016effects,xu2005open,xu2005topological,yu2013study |
| Missing | 16 | \citeSbana2018influence,behfar2016intragroup,behfar2018knowledge,bianchini2016social,damian2007collaboration,gao2007network,herbsleb2003empirical,hu2016influence,lima2014coding,maclean2013apache,qiu2019going,van2010importance,wagstrom2005social,xu2004exploration,yu2014exploring,yu2014exploring |
| NA | 12 | \citeSamrit2004social,begel2010codebook,borici2012proxiscientia,de2004technical,de2005seeking,fonseca2006exploring,gao2007towards,gote2019git,ichimura2015analysis,ohira2005supporting,schwind2010tool,valetto2007using |
| #People | #Pubs. | Publications |
|---|---|---|
| 1-10 | 4 | \citeScherry2008social,damian2007collaboration,omoronyia2009using,pohl2008dynamic |
| 11-100 | 32 | \citeSabreu2009developer,ahuja2003individual,avritzer2010coordination,banitaan2013decoba,bird2008latent,cataldo2012impact,chang2007out,crowston2007self,damian2013role,datta2010social,de2007toward,ehrlich2006leveraging,gharehyazie2017tracing,honsel2015mining,huang2005mining,jarczyk2018surgical,jermakovics2011mining,kakimoto2006social,kavaler2017stochastic,lim2011evolving,marczak2008information,meneely2010improving,meneely2010use,mikawa2009removing,palomba2018beyond,panichella2014developers,sarker2011path,sarker2011role,schwind2008unveiling,urdangarin2008experiences,yu2007mining,zhang2012empirical |
| 101-1000 | 64 | \citeSaljemabi2017empirical,bianchini2015developers,bianchini2016role,caglayan2013emergence,canfora2012who,cataldo2006identification,cataldo2008communication,cataldo2008socio,crowston2006hierarchy,datta2011evolution,datta2018does,dittrich2013network,duc2011impact,ducheneaut2005socialization,ehrlich2012all,gharehyazie2013social,gharehyazie2015developer,gloor2003visualization,guilherme2017assessing,herbsleb2003empirical,honsel2015developer,hossain2006actor,huang2011relating,jermakovics2013exploring,joblin2015developer,joblin2017classifying,joblin2017evolutionary,kamei2008analysis,kerzazi2017knowledge,kwan2011does,licorish2014understanding,licorish2015communication,licorish2017exploring,lim2012stakerare,linaaker2019method,lopez2008applying,maclean2011knowledge,meneely2008predicting,meneely2009secure,meneely2010strengthening,meneely2011socio,mockus2010organizational,ngamkajornwiwat2008exploratory,nguyen2008global,orsila2009trust,ortu2015measuring,robertsa2006communication,ryan2010modeling,schwind2008svnnat,sowe2014empirical,spinellis2006global,sun2017enhancing,surian2011recommending,tamburri2019discovering,tymchuk2014collaboration,weiss2006evolution,wolf2008does,wolf2009predicting,wu2011drex,xuan2012measuring,zanetti2012quantitative,zanetti2013categorizing,zhang2011network,zhang2012automated |
| 1001-10000 | 35 | \citeSaljemabi2018empirical,avelino2019measuring,behfar2016intragroup,behfar2018knowledge,bhowmik2016optimal,bidoki2018network,bird2006mining,bird2006mining1,canfora2011social,casalnuovo2015developer,el2017periphery,hannemann2013community,he2012applying,honsel2014software,hu2008comparison,jeong2009improving,kerzazi2016can,kumar2019studying,leibzon2016social,li2016task,lim2010stakenet,liu2007design,nakakoji2005understanding,nia2010validity,nzeko2015social,ogawa2007visualizing,singh2010small,sowe2006identifying,sureka2011using,syeed2013socio,toral2010analysis,xuan2016converging,yu2008mining,zhang2014butter,zhang2019file |
| 10001-100000 | 27 | \citeSantwerp2010importance,batista2017collaboration,bhattacharya2014determining,bird2007open,cohen2018large,dravzdilova2012method,hu2012reputation,kumar2013evolution,long2007social,madey2002open,panichella2014evolution,qiu2019going,Sapkota2020,sarma2009tesseract,sharma2017boundary,surian2010mining,tan2007social,thung2013network,van2010importance,van2010open,wan2018scsminer,wang2013devnet,wu2018empirical,xuan2012developer,zanetti2013rise,zhang2013heterogeneous,zhou2011does |
| 100000 | 15 | \citeSallaho2013analyzing,bernardi2012developers,conaldi2010meso,gao2007network,hahn2008emergence,hong2011understanding,hu2018user,jiang2013understanding,lima2014coding,ohira2005accelerating,wang2019investigating,xu2005topological,yu2013study,yu2014exploring,yu2014exploring |
| Missing | 66 | \citeSbana2018influence,bettenburg2010studying,bettenburg2013studying,bhattacharya2010fine,bhattacharya2012automated,bhattacharya2012graph,bianchini2016social,biccer2011defect,bird2009putting,cataldo2008communication1,catolino2019gender,ccaglayan2016effect,chen2010improving,cheng2017developer,crowston2005social,crowston2006core,damian2007awareness,de2007supporting,dos2011bringing,ell2013identifying,feczak2009measuring,feczak2011exploring,geipel2014communication,gilbert2007codesaw,gonzalez2004community,hahn2006impact,hinds2006structures,hossain2008measuring,hossain2009social,howison2006social,hu2013using,hu2016influence,iyer2019requirements,kidane2007correlating,lee2013github,lopez2004applying,maclean2013apache,mcdonald2003recommending,mergel2015open,miranskyy2014effect,peng2018co,peng2019co,pinzger2008can,schroter2010predicting,sharma2011studying,simpson2013changeset,singh2010developer,singh2011network,surian2013predicting,teixeira2015lessons,wagstrom2005social,wang2012human,wang2012survival,wang2016analyzing,wang2016diffusion,wiggins2008social,wolf2009mining,wu2016effects,xu2004exploration,xu2005open,yang2014social,yang2014utilizing,zhang2014generative,zhang2014mining,zhang2014novel,zhang2015analyzing |
| NA | 12 | \citeSamrit2004social,begel2010codebook,borici2012proxiscientia,de2004technical,de2005seeking,fonseca2006exploring,gao2007towards,gote2019git,ichimura2015analysis,ohira2005supporting,schwind2010tool,valetto2007using |
| Title | Authors | Year | #Cit. |
| An empirical study of speed and communication in globally distributed software development | James D. Herbsleb, Audris Mockus | 2003 | 1127 |
| \hdashlineIndividual Centrality and Performance in Virtual R&D Groups: An Empirical Study | Manju K. Ahuja, Dennis F. Galletta, Kathleen M. Carley | 2003 | 665 |
| \hdashlineMining email social networks | Christian Bird, Alex Gourley, Premkumar Devanbu, Michael Gertz, Anand Swaminathan | 2006 | 644 |
| \hdashlineThe social structure of free and open source software development | Kevin Crowston, James Howison | 2005 | 602 |
| \hdashlineIdentification of Coordination Requirements: Implications for the Design of Collaboration and Awareness Tools | Marcelo Cataldo, Patrick A. Wagstrom, James D. Herbsleb, Kathleen M. Carley | 2006 | 465 |
| \hdashlineSocialization in an Open Source Software Community: A Socio-Technical Analysis | Nicolas Ducheneaut | 2005 | 459 |
| \hdashlineImproving Bug Triage with Bug Tossing Graphs | Gaeul Jeong, Sunghun Kim, Thomas Zimmermann | 2009 | 434 |
| \hdashlineThe Open Source Software Development Phenomenon: An Analysis Based on Social Network Theory | Gregory Madey, Vincent Freeh, Renee Tynan | 2002 | 342 |
| \hdashlineThe role of communication and trust in global virtual teams: A social network perspective | Saonee Sarker, Manju K. Ahuja, Suprateek Sarker, Sarah Kirkeby | 2011 | 313 |
| \hdashlineLatent social structure in open source projects | Christian Bird, David Pattison, Raissa D’Souza, Vladimir Filkov, Premkumar Devanbu | 2008 | 301 |
| \hdashlineSocio-Technical Congruence: A Framework for Assessing the Impact of Technical and Work Dependencies on Software Development Productivity | Marcelo Cataldo, James D. Herbsleb, Kathleen M. Carley | 2008 | 300 |
| \hdashlineEmergence of New Project Teams from Open Source Software Developer Networks: Impact of Prior Collaboration Ties | Jungpil Hahn, Jae Y. Moon, Chen Zhang | 2008 | 296 |
| \hdashlineCan developer-module networks predict failures? | Martin Pinzger, Nachiappan Nagappan, Brendan Murphy | 2008 | 243 |
| \hdashlinePredicting failures with developer networks and social network analysis | Andrew Meneely, Laurie Williams, Will Snipes, Jason Osborne | 2008 | 241 |
| \hdashlineSelf-organization of teams for free/libre open source software development | Kevin Crowston, Qing Li, Kangning Wei, U. Y. Eseryel, James Howison | 2007 | 240 |
| \hdashlineRecommending collaboration with social networks: A comparative evaluation | David W. McDonald | 2003 | 226 |
| \hdashlineCodebook: discovering and exploiting relationships in software repositories | Andrew Begel, Yit P. Khoo, Thomas Zimmermann | 2010 | 222 |
| \hdashlinePredicting build failures using social network analysis on developer communication | Timo Wolf, Adrian Schröter, Daniela Damian, Thanh H.D. Nguyen | 2009 | 217 |
| \hdashlineAwareness in the Wild: Why Communication Breakdowns Occur | Daniela Damian, Luis Izquierdo, Janice Singer, Irwin Kwan | 2007 | 214 |
| \hdashlineStructures that work: social structure, work structure and coordination ease in geographically distributed teams | Pamela Hinds, Cathleen McGrath | 2006 | 209 |
| \hdashlineApplying social network analysis to the information in CVS repositories | Luis Lopez-Fernandez, Gregorio Robles, Jesus M. Gonzales-Barahona | 2004 | 207 |
| \hdashlineCore and Periphery in Free/Libre and Open Source Software Team Communications | Kevin Crowston, Kangning Wei, Qing Li, James Howison | 2006 | 204 |
| \hdashlineHierarchy and centralization in free and open source software team communications | Kevin Crowston, James Howison | 2006 | 200 |
| \hdashlineA Topological Analysis of the Open Source Software Development Community | Jin Xu, Yongqin Gao, Scott Christley, Gregory Madey | 2005 | 198 |
| \hdashlinePutting It All Together: Using Socio-technical Networks to Predict Failures | Christian Bird, Nachiappan Nagappan, Harald Gall, Brendan Murphy, Premkumar Devanbu | 2009 | 197 |
| \hdashline |
| Author | #Cit. | #Pubs. | #Influential Pubs. |
|---|---|---|---|
| James D. Herbsleb | 2271 | 7 | 3 |
| Kathleen M. Carley | 1488 | 4 | 3 |
| Premkumar Devanbu | 1477 | 10 | 3 |
| Christian Bird | 1472 | 8 | 3 |
| James Howison | 1397 | 6 | 4 |
| Kevin Crowston | 1397 | 6 | 4 |
| Daniela Damian | 1026 | 11 | 2 |
| Gregory Madey | 704 | 9 | 2 |
| Vladimir Filkov | 498 | 9 | 1 |
| Venue | #Pubs. |
|---|---|
| International Conference on Software Engineering (ICSE) | 17 |
| International Conference on Open Source Software (OSS) | 16 |
| International Conference on Mining Software Repositories (MSR) (Workshop until 2007, Working Conference until 2015) | 15 |
| Conference on Computer Supported Cooperative Work (CSCW) | 10 |
| International Conference on the Foundations of Software Engineering (FSE) | 8 |
| Hawaii International Conference on System Sciences (HICSS) | 8 |
| Asia-Pacific Software Engineering Conference (APSEC) | 8 |
| Empirical Software Engineering, Springer | 7 |
| International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE) | 7 |
| Information and Software Technology, Elsevier | 6 |
| Journal of Systems and Software (JSS) | 5 |
| International Conference on Global Software Engineering (ICGSE) | 5 |
| International Conference on Software Maintenance and Evolution (ICSME) (ICSM until 2013) | 5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A systematic mapping study of developer social network research
Steffen Herbold11footnotemark: 1
Aynur Amirfallah22footnotemark: 2
Fabian Trautsch33footnotemark: 3
Jens Grabowski44footnotemark: 4
Institute of Computer Science, University of Goettingen, Germany
Abstract
Developer social networks (DSNs) are a tool for the analysis of community structures and collaborations between developers in software projects and software ecosystems. Within this paper, we present the results of a systematic mapping study on the use of DSNs in software engineering research. We identified 255 primary studies on DSNs. We mapped the primary studies to research directions, collected information about the data sources and the size of the studies, and conducted a bibliometric assessment. We found that nearly half of the research investigates the structure of developer communities. Other frequent topics are prediction systems build using DSNs, collaboration behavior between developers, and the roles of developers. Moreover, we determined that many publications use a small sample size regarding the number of projects, which could be problematic for the external validity of the research. Our study uncovered several open issues in the state of the art, e.g., studying inter-company collaborations, using multiple information sources for DSN research, as well as general lack of reporting guidelines or replication studies.
keywords:
developer social networks; mapping study; literature survey
\newcites
SPrimary Studies
1 Introduction
Social structures within software development projects are a topic that received a lot of attention in different research communities, e.g., by researchers interested in open source development, global software engineering, and mining software repositories. Developer Social Networks are often inferred automatically from information that can be found in forges like GitHub, Mailing Lists, Issue Tracking Systems, and Version Control Systems of software development projects. The DSNs give valuable insights into the projects, e.g., regarding the importance of individuals \citeSjoblin2015developer, patterns in communication behavior \citeSdamian2007collaboration, for the identification of single points of failure \citeStamburri2019discovering, gender-aspects \citeScatolino2019gender, and even bugs \citeSpinzger2008can. Due to the magnitude of publications on DSNs, the diversity of topics addressed by DSNs, and the lack of a contemporary literature review, a novel literature study is required to ensure that researchers and practitioners can get a complete overview on the state of the art of DSNs. This article describes a mapping study performed based on the rigorous guidelines by Kitchenham and Charters [1] for literature reviews with the goal to identify and map research on DSNs. We map the publications on DSNs to research topics and analyze the scope of the publications in terms of data sources, number of projects, and number of people.
With our mapping study, we provide the following contributions.
A contemporary overview of the state of the art of the literature on DSNs.
- 2.
A summary of the already investigated research directions, including the relevant literature.
- 3.
A summary of the data sources, as well as the size of the DSNs in terms of number of projects and people involved.
- 4.
A bibliometric assessment to identify influential publications, authors, venues, and interest in the topic over time.
- 5.
The identification of open issues within the current state of the art.
We found that 49% of all publications on DSNs analyze the structure of the community, either in general, or with respect to other aspects of software development, e.g., the evolution, or the impact on code quality. Other frequent topics in research are prediction systems based on DSNs, e.g., for defect prediction or bug triage, the collaboration behavior between developers, and the roles of developers. Regarding the way that studies are conducted, we found that 79% of the studies are based on a single data source and 70% of the studies use less then 11 projects to draw conclusions. These are concerning findings regarding the generalizability of results. Regardless, 80% of publications use social networks with at least 100 people modelled by the network, i.e., large networks are usually the foundation for analysis, which is good for the generalizability. Thus, we believe there is a need for studies with high external validity on DSNs, especially more studies that consider a large amount of different projects in order to derive generalizable conclusions for diverse populations. Other open issues in the state of the art are, e.g., inter-company collaborations and the use of data from multiple information sources for the analysis of DSNs. Finally, the extraction of data from the publications for this mapping study revealed a lack of reporting guidelines for DSNs, i.e., some publications fail to report basic meta data about the studies conducted, e.g., the number of projects considered, the number of developers involved, or how data was processed, e.g., to deal with duplicate identities.
The remainder of this paper is organized as follows. We give a definition of DSNs in Section 2. In Section 3, we present our methodology for the mapping study, including our research questions, inclusion and exclusion criteria for the literature, how we identified publications, and the data we collected for each included publication. In Section 4, we give the results of our review, by listing the primary studies we found and map them to DSN concepts according to our research questions. In Section LABEL:sec:discussion, we discuss open issues regarding DSN research based on the results of our mapping study. Then, we discuss related prior literature studies in Section LABEL:sec:relatedwork, and conclude the article in Section LABEL:sec:conclusion.
2 Definition of Developer Social Networks (DSNs)
A definition is difficult, because different data sources, research goals, and modelling approaches are used to represent DSNs in the literature. Due to this, publications on DSNs contain the specific definition of their DSN structure, but this varies between publications. For our purpose, we require a definition, that can be applied to validate if a construct is an instance of a DSN. We identified three necessary and sufficient conditions for DSNs.
A DSN is described by a graph where denotes a set of vertices and a set of edges such that . The graph can be directed or undirected, depending on the intent of the researchers and the data that is used for modelling the DSN. 2. 2.
The vertices or a subset of the vertices must represent actors of a software development process, e.g., developers, users, or project managers. 3. 3.
The edges represent connections between vertices that are based on communication behavior (e.g., email communication) or collaboration behavior (e.g., contributions to the same software artifact).
An example of a DSN is given in Figure 1. This figure depicts an anonymized excerpt of the DSN created by Bird et al. \citeSbird2006mining. The vertices in this graph represent different developers, which were active on Apache email lists. A directed edge between two vertices exists, if the developer has sent or replied to at least 150 emails of another developer.
3 Methodology
Our review follows the guidelines for systematic literature reviews proposed by Kitchenham and Charters [1]. Additionally, we used backward and forward snowballing, which was suggested for systematic literature studies by Wohlin [2]. In the following, we define our underlying research questions, inclusion and exclusion criteria, how we identified papers, and which data was collected for our study. We do not define our study as systematic literature review but as a systematic mapping study, because we did not perform any synthesis of the results, but only provide an overview of the literature.
3.1 Research Questions
In order to study the state of the art in DSNs, we defined the following five research questions to guide our mapping study.
RQ1. What software engineering topics have been addressed by DSNs?
- 2.
RQ2. Which data sources are used for modelling of DSNs?
- 3.
RQ3. What is the scope of the analysis…
- a)
with respect to number of projects considered
- b)
and people modelled by the DSNs?
- 4.
RQ4. What are the most influential…
- a)
publications?
- b)
authors?
- c)
venues?
- 5.
RQ5. How did the interest in DSN research evolve over time?
The first three research questions guide our analysis of the state of the art of DSNs. We want to get insights into both the topics that are under investigation within the research community, as well as the amount of studies on different topics through our analysis for RQ1. The research questions RQ2 and RQ3 guide our investigation of the scope of studies. Through the answer to RQ2, we want to get valuable information about the data sources that researchers use to define social relationships. Through RQ3, we want to gain insights into how large the studies are, e.g., if they are case studies of specific cases with few projects or if they are broad studies over hundreds of projects. The fourth and fifth question give us insights into the community of DSN research itself. RQ4 will tell us which work had the most impact, i.e., early foundational work and later work that presented new ideas for the use of DSNs that influenced many other publications. Moreover, we assess if there are authors who are clearly distinguished in the field of DSN research through their publications. We also look at the venues where DSN research is most often published to gain insights into which communities frequently use DSNs in their research. Through RQ5 we want to understand how the interest in DSN research evolves over time, e.g., if the interest is still growing or if the topics of interest change over time.
3.2 Inclusion and Exclusion Criteria
To identify which papers should be part of our review, we defined the following criteria for inclusion:
publications that describe DSNs;
- 2.
publications that describe how DSNs may be created; and
- 3.
publications that describe theoretical aspects of DSNs.
Additionally, we used the following exclusion criteria:
publications that only summarize existing work without new contributions;
- 2.
publications that only consider social networks or graph structures in general, without a direct and specific relation to software development;
- 3.
publications that were not peer-reviewed; and
- 4.
publications that are not published in English.
3.3 Identification of Primary Studies
Figure 2 summarizes our workflow for the identification of primary studies. We used a five step procedure.
Initial scan of the literature using search engines and prior literature studies to identify a seed of publications. 2. 2.
Backward and forward snowballing of publications found in the initial scan. 3. 3.
Second scan of the literature using search engines to capture the remainder of 2017 and to account for delayed indexing of publications. 4. 4.
Backward and forward snowballing of publications found in the second scan. 5. 5.
Final check of inclusion and exclusion criteria on all identified publications.
In the first step, we searched for publications by using five search engines: Google Scholar, IEEE Xplore, ACM Digital Library, Springer Link, Elsevier Search, and Scopus555Scopus was only used for the additional search in the third step and not for the initial search.. We used three queries for each search engine: ”developer social networks”, ”developer network”, and ”collaborative networks OSS”. Table 1 gives an overview on the number of hits we had with our search terms in each of the search engines. This initial search was conducted between May 2017 and September 2017. Due to the extremely high number of hits, we considered only 750 hits per search engine and search term to get the literature seed for our mapping study. Next, we selected candidates for inclusion by reading the titles, abstracts, and, if it was necessary, the introduction and conclusion sections of the publications. We identified 145 publications through this procedure from the search engines. Additionally, we scanned the primary studies from prior related literature studies by Zhang et al. [3], Tamburri et al. [4], Manteli et al. [5], and Abufouda and Abukwaik [6] (see Section LABEL:sec:relatedwork). We identified 39 additional publications from the prior studies. This difference is mainly due to the scope of the other literature studies, especially with respect to search terms. For example, Manteli et al. [5] focus on global software engineering and, therefore, also use search terms that do not mention DSNs. Thus, we identified 184 publications in this first step.
In the second step, we checked the related work cited in each of the publications we found using the search engines. This step is also known as backward snowballing [2]. Moreover, we used the “cited by” function of Google Scholar, to identify publications that cited the publications we identified with the search engines. This step is also known as forward snowballing [2]. We also applied the snowballing to each additional publication we found. We identified 32 additional publications, i.e., 216 publications in total. The snowballing also served to mitigate potential negative effects because we did not consider every hit for the search terms with the search engines. Our assumption is that we find the literature we may have missed through the snowballing. Moreover, same as the use of the prior literature reviews as seed for the snowballing, the snowballing allowed us to identify literature that did not mention the DSN in the paper title or abstract and was, therefore, missed by our search.
In the third step, we repeated our search for literature from the first step. This was required, because the initial search already started in May 2017, i.e., we could not be confident that all papers from 2016 were indexed by the search engines and part of the data for 2017 was not available yet. Moreover, we wanted to include recent publications, that would be missing otherwise. Thus, we repeated the search engines Google Scholar in July 2018 and July 2019 and with SCOPUS in February 2020. This way, we identified 31 new publications using Google Scholar and 29 publications using SCOPUS, bringing our total number of publications to 276. Afterwards, in the fourth step, we performed an additional round of snowballing on these publications and identified 20 additional publications, i.e., a total of 296 publications.
Before we started with the data collection, we validated whether all identified candidates met the inclusion criteria or violate the exclusion criteria in our last step. This way, we excluded 41 of the identified publications, mainly because they were not peer reviewed (e.g., book chapters, preprints on arXiv), summarized only existing work (e.g., surveys, dissertation summaries), or because they did not contain anything specific to developer social networks, regardless of our initial assessment. This left us with 255 primary studies.
3.4 Data Collection
Once all literature was identified, we proceeded with the collection of the data required to answer our research questions. For RQ1, we first extracted the research questions and/or hypothesis that were formulated to guide the research, as well as the contributions as listed in the introduction or summarized in the abstract from the publications. We used inductive coding [7] performed by two researchers to identify the research topics of the papers from the hypothesis and contributions in order to obtain the necessary information to answer RQ1. For this, we printed the title, research questions/hypotheses, and contributions of each publication on a separate sheet of paper and sorted them incrementally by their topic, starting with a coarse-grained separation until we were satisfied that our categories provided a sufficient amount of detail for our mapping study. For RQ2 and RQ3, we extracted the data source, the number of projects, and the number of participants in the DSN used within the publications. For RQ4 and RQ5, we collected meta data about the publications themselves, i.e., the title, authors, publication venue, year, and number of citations. We organized the collected data in a spreadsheet which is made available as supplementary material.
4 Literature Review
In this section, we provide the review of the the state of the art of DSN research based on the data collection we described in Section 3. We systematically address different topics. We use the data from this review to answer our research questions in Section LABEL:sec:discussion.
4.1 Research Directions
Based on the description of the contributions, the research questions, and the research hypotheses of publications, we identified seven general research directions regarding DSNs. For four of the general research directions we identified subtopics, i.e., specific aspects that were considered within the general direction. Table 2 shows our mapping of publications to the research directions including subtopics.
Nearly half of the publications we identified analyze the community structures in software development projects. Most of these publications analyzed the general structure of the DSN. However, we also identified seven more specific subtopics of the analysis of community structures: the evolution of the communities by considering DSNs over time; community structures in the context of global software engineering; the formation of teams within development projects; the correlation between the community structure and code quality; the analysis of socio-technical congruence; the simulation of community structures; and the identification of community smells.
DSNs are frequently used for the creation or improvement of prediction models for various aspects in software development projects. We identified seven subtopics of prediction approaches using DSNs: bug triage, i.e., support for assigning appropriate developers to work on bug reports; defect prediction, i.e., using the social structure of a project to enhance models that estimate the defect-proneness of different parts of software; recommendation of suitable developers for project work in general; predictions of the outcome of a project, i.e., if projects are likely successful; predictions of suitable Web services; predictions of build failures; and prediction of appropriate developers for code review.
The collaboration behavior was also scrutinized using DSNs. While DSNs are modelling some direct or indirect collaboration behavior in software development projects, the analysis of the collaboration behavior itself is in general not the focus. The publications we identified for this research direction focus directly on the collaboration behavior, e.g., which tools were used or how collaboration behavior was impacted by the structure of projects. In addition to research on collaboration behavior in general, we identified three more specific subtopics: collaboration behavior in global software engineering; problems in collaboration behavior and how they are reflected in DSNs; and collaboration between developers from different companies, including competitors in open source projects.
DSNs are also frequently used to assess the roles of developers within a development project, e.g., whether a developer is a core developer or a peripheral developer. While the identification of roles for developers in general is the main topic of this research direction, we also identified two other subtopics; the analysis of how onboarding of peripheral developers within projects works; and how developers specialize within a project.
We also identified research regarding tools for DSN analysis, mostly for the visualization of DSNs based on different information sources.
The validity of DSN research was also considered by five publications. These publications do not question the validity of DSN research in general, but rather analyze how properties of DSN research may depend on the specific context of research projects, e.g., the scope of the analysis or the repository that was used as source for the DSNs.
Finally, we found one publication on a data set that directly contains the graph structure of a DSN. The lack of publications on data sets shows that researchers either generate DSNs from data they collect, or from more general data sets that do not model DSNs directly. Such data sets contain general information mined from software repositories from which a DSN is then built.
Answer to RQ 1: Community structures are the dominant research direction. Other frequently studied directions are DSNs for predictions, collaboration behavior and developer roles. Tools, studies on validity, and data sets play only a minor role.
4.2 Data Sources
There are five major data sources which are used by 241 of the 255 publications:
Forges like GitHub or SourceForge that are used by millions of developers for hosting and developing open source software. These forges offer an integration of VCSs and ITSs within a single environment, often coupled with other services like Web pages, hosting of releases, or Wikis. Thus, they are a rich source for collaborations between developers, both within a project, as well as across multiple projects.
- 2.
ITSs like Jira or Bugzilla are used for the collection, tracking, and management of issues and work items within projects, e.g., change requests, bug reports, or questions by users. ITSs allow the discussion about issues, the definition of work flows for issues, and different types of resolutions.
- 3.
VCSs like Git or SVN are systems that track and archive changes of files and folders over time. Typically, VCSs allow different development branches and support working collaboratively on the same resources [8].
- 4.
MLs are collections of email addresses that can be used for communication within software projects. MLs may be restricted, e.g., not everybody may be allowed to post or subscribe to a ML. Participants of MLs may be natural persons (e.g., developers, users), but also systems (e.g., continuous integration systems, ITSs).
- 5.
Surveys, i.e., interviews or questionnaires that were used to directly ask developers about their communication behaviour within a development project.
In addition to the five major sources, there are other ways that researchers used to collect information about collaboration behavior which we summarized as ”Other” in Table 3. These are IRC chats \citeScataldo2008communication1,cataldo2008communication,panichella2014developers,wang2016diffusion, plug-ins that monitor development environments \citeSomoronyia2009using,borici2012proxiscientia,de2007supporting, manual inspection of project documents, e.g., requirements \citeSdamian2013role,damian2007collaboration,marczak2008information, owners ob web service mash-ups \citeSbianchini2016role, bianchini2016social, bianchini2015developers, the web site Ohloh that provides statistics about open source development666The name has changed to https://www.openhub.net/. \citeShu2008comparison,hu2012reputation, online discussion forums \citeSwiggins2008social,crowston2007self, JAR files \citeShu2013using, the BlogLinks and Advogoto social networks777Both are not available online anymore. of software developers \citeSwagstrom2005social, on site researchers that observe communication behavior \citeSdamian2007awareness, employee directories \citeSbegel2010codebook, and the code review portal Gerrit \citeSyang2014social. Additionally, one publication discusses DSNs from an abstract perspective and proposes the use of tracking for every communication including phone calls, emails, etc. \citeSamrit2004social.
Figure 3 depicts the number of data sources that were used for modelling DSNs. It highlights that 204 of the 255 publications build a DSN that is based on a single source, 43 publications used a combination of two data sources, six publication three data sources and two publications four data sources.
Answer to RQ 2: Software repositories like forges, ITSs, VCSs and MLs are the main sources for DSNs, however, surveys are also sometimes used. Publications commonly use a single source for DSN modelling. The knowledge about DSNs built with multiple sources is limited.
4.3 Number of Projects Analyzed
A major factor regarding the external validity of results is the number of projects for which data is collected. If only data about very few projects is used for an empirical study about a phenomenon that can be studied using DSNs, the results may not generalize to other projects. The likelihood that the results generalize to software engineering in general increases with the number of projects that are analyzed. Table 4 shows the number of projects per publication. The data we collected shows that most papers on DSNs perform some sort of empirical study to demonstrate their approach or research a phenomenon. Only 12 of the 255 publications we identified did not perform any empirical study. Moreover, we identified 16 publications for which we could not identify the number of projects from the publication. There were two reasons for this: either the authors did not report how they selected a smaller subset from a larger database or the authors did not specify which projects were used at all. This is not only problematic for evaluating the external validity of a study, but also hinders replications of the results. Of the 227 publications for which we could identify the number of projects, 76 used only a single project for their empirical study, 69 used only 2-5 projects for the empirical study. In other words, about 33% of the publications on DSNs used a single project, another 30% used 2-5 projects. Both numbers are extremely low and do not allow for a generalization of the findings due to the limited context covered by the projects. Another 12 publications only considered 6-10 projects, which is still a small number. On the bright side, 50 publications used more than 100 projects, i.e., larger sample sizes that usually allow to generalize findings. 38 of these publications use a forge as data source. Regardless, our analysis of the sample sizes with respect to the number of projects indicates a severe threat to the external validity of many empirical studies on DSNs.
Answer to RQ 3a: Over 69% of all publications use less than 11 projects to evaluate their findings. Most publications with at least 100 projects use a forge as data source (38 of 50).
4.4 Number of Developers in the DSNs
The second major factor regarding the validity of results is the number of people that are part of the DSNs. Table 5 shows the data we collected regarding the number of people in the DSNs. In case a publication created multiple DSNs, e.g., one per project considered, we report the mean value of the people in the DSNs. The number of people modelled by the DSNs is relatively high. 77 publications have more than 1,000 people as part of their DSNs, 15 publications actually model more than 100,000 people. Only four publications have very small networks with less than or equal to 10 people, another 32 publications consider less than or equal to 100 people. Thus, for the publications for which the data about the number of people is available, the networks that are considered are in general relatively large. When we looked closely at the data, we observed two reasons for this: first, while many publications consider only few projects, these projects tend to be very large, e.g., Mozilla Firefox and the Eclipse IDE. Moreover, our data also shows that MLs and forges are the most common data sources for DSNs. Both capture not only developers, but also users of the respective projects. We also found a very concerning general trend in the literature: 66 of the 240 publications that performed an empirical study did not report the number of participants in the DSN. This is a vital piece of information for the estimation of both the internal and external validity of empirical studies that should always be reported.
Answer to RQ 3b: Most publications report networks that have more than 100 vertices. The number of developers is often much larger than the number of projects, because large-scale projects with big communities are analyzed.
4.5 Influential Publications
We collected data regarding the citation counts from Google Scholar. We take the pattern from the ACM Distinguished Paper awards to define our criterion for influential publications, and consider the top 10% with the most citations as influential. Since we have 255 publications, this means we consider the 25 publications with the most citations (Table LABEL:tbl:mostcited). We note that the citations for the third most cited paper \citeSbird2006mining also include the citations for the paper \citeSbird2006mining1, because the two publications are considered as the same paper by Google Scholar. The 25 most influential publications address
software development with globally distributed project members \citeSherbsleb2003empirical,ahuja2003individual,hinds2006structures;
- 2.
community structures in software development projects \citeSbird2006mining,crowston2005social,ducheneaut2005socialization,madey2002open,bird2008latent,crowston2006hierarchy,lopez2004applying,xu2005topological;
- 3.
the formation of teams in projects through collaboration \citeShahn2008emergence,crowston2007self;
- 4.
the identification of relationships between developers \citeSbegel2010codebook;
- 5.
the impact of coordination requirements between developers on tool design \citeScataldo2006identification and modularization \citeScataldo2008socio;
- 6.
communication issues \citeSdamian2007awareness and trust \citeSsarker2011role;
- 7.
the identification of core developers \citeScrowston2006core;
- 8.
predictions to support software engineering processes, i.e., bug triage \citeSjeong2009improving, defect prediction \citeSpinzger2008can,meneely2008predicting,bird2009putting, build failure prediction \citeSwolf2009predicting, and collobariotions \citeSmcdonald2003recommending.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Kitchenham, S. Charters, Guidelines for Performing Systematic Literature Reviews in Software Engineering (Version 2.3), Technical Report EBSE-2007-01, Keele Univ., EBSE (2007).
- 2[2] C. Wohlin, Guidelines for snowballing in systematic literature studies and a replication in software engineering , in: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE ’14, ACM, New York, NY, USA, 2014, pp. 38:1–38:10. doi:10.1145/2601248.2601268 . URL http://doi.acm.org/10.1145/2601248.2601268 · doi ↗
- 3[3] W. Zhang, L. Nie, H. Jiang, Z. Chen, J. Liu, Developer social networks in software engineering: construction, analysis, and applications, Science China Information Sciences 57 (12) (2014) 1–23.
- 4[4] D. A. Tamburri, P. Lago, H. v. Vliet, Organizational social structures for software engineering, ACM Computing Surveys (CSUR) 46 (1) (2013) 3.
- 5[5] C. Manteli, H. Van Vliet, B. Van Den Hooff, Adopting a social network perspective in global software development, in: Global Software Engineering (ICGSE), 2012 IEEE Seventh International Conference on, IEEE, 2012, pp. 124–133.
- 6[6] M. Abufouda, H. Abukwaik, On using network science in mining developers collaboration in software engineering: A systematic literature review, International Journal of Data Mining & Knowledge Management Process 7 (5/6) (2017) 1–20. doi:10.5121/ijdkp.2017.7601 . · doi ↗
- 7[7] D. R. Thomas, A general inductive approach for analyzing qualitative evaluation data , American Journal of Evaluation 27 (2) (2006) 237–246. ar Xiv:https://doi.org/10.1177/1098214005283748 , doi:10.1177/1098214005283748 . URL https://doi.org/10.1177/1098214005283748 · doi ↗
- 8[8] I. Sommerville, et al., Software engineering, Boston: Pearson,, 2011.
