A Cross-Repository Model for Predicting Popularity in GitHub

Neda Hajiakhoond Bidoki; Gita Sukthankar; Heather Keathley; Ivan; Garibay

arXiv:1902.05216·cs.SI·February 15, 2019

A Cross-Repository Model for Predicting Popularity in GitHub

Neda Hajiakhoond Bidoki, Gita Sukthankar, Heather Keathley, Ivan, Garibay

PDF

TL;DR

This paper introduces a cross-repository LSTM model that predicts the popularity of GitHub repositories by leveraging events across multiple repositories, outperforming traditional single-repository models.

Contribution

The paper presents a novel LSTM-based approach that incorporates cross-repository data for more accurate popularity prediction in social coding platforms.

Findings

01

LSTM model outperforms ARIMA in popularity prediction

02

Cross-repository information improves forecasting accuracy

03

Model captures influence of one repository's events on others

Abstract

Social coding platforms, such as GitHub, can serve as natural laboratories for studying the diffusion of innovation through tracking the pattern of code adoption by programmers. This paper focuses on the problem of predicting the popularity of software repositories over time; our aim is to forecast the time series of popularity-related events (code forks and watches). In particular, we are interested in cross-repository patterns-how do events on one repository affect other repositories? Our proposed LSTM (Long Short-Term Memory) recurrent neural network integrates events across multiple active repositories, outperforming a standard ARIMA (Auto-Regressive Integrated Moving Average) time series prediction based on the single repository. The ability of the LSTM to leverage cross-repository information gives it a significant edge over standard time series forecasting.

Tables2

Table 1. TABLE I: Data set details and hyperparameters

Number of repositories	100
Time-step length	10 days
Sequence length	2/3/4/5/6
Number of features	100
Number of hidden layers	2
Number of nodes in each hidden layer	1/2/3
Loop back	8

Table 2. TABLE II: Total Average RMSE for LSTM and ARIMA. LSTM Yields Lower Error on the Test Data.

Model	Average $R M S E$
LSTM	312.07
ARIMA	401.35

Equations6

RMSE_{r}=\sqrt{\frac{1}{T}\sum_{t=1}^{T}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

RMSE_{r}=\sqrt{\frac{1}{T}\sum_{t=1}^{T}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

RMSE_{t}=\sqrt{\frac{1}{R}\sum_{r=1}^{R}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

RMSE_{t}=\sqrt{\frac{1}{R}\sum_{r=1}^{R}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

TotalaverageofRMSE=\sqrt{\frac{1}{R*T}\sum_{t=1}^{T}\sum_{r=1}^{R}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

TotalaverageofRMSE=\sqrt{\frac{1}{R*T}\sum_{t=1}^{T}\sum_{r=1}^{R}\big{(}Y_{r,t}-\widehat{Y}_{r,t}\big{)}^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Cross-Repository Model for Predicting

Popularity in GitHub

Neda Hajiakhoond Bidoki1, Gita Sukthankar1, Heather Keathley2 and Ivan Garibay2

1Department of Computer Science

Email: [email protected], [email protected]

2Industrial Engineering & Management Systems

Email: [email protected], [email protected]

Abstract

Social coding platforms, such as GitHub, can serve as natural laboratories for studying the diffusion of innovation through tracking the pattern of code adoption by programmers. This paper focuses on the problem of predicting the popularity of software repositories over time; our aim is to forecast the time series of popularity-related events (code forks and watches). In particular we are interested in cross-repository patterns—-how do events on one repository affect other repositories? Our proposed LSTM (Long Short-Term Memory) recurrent neural network integrates events across multiple active repositories, outperforming a standard ARIMA (Auto Regressive Integrated Moving Average) time series prediction based on the single repository. The ability of the LSTM to leverage cross-repository information gives it a significant edge over standard time series forecasting.

Index Terms:

LSTM, Social Network Analysis, Popularity

I Introduction

As the world becomes more interconnected and project teams are more commonly geographically dispersed, the role that social networks and social media play in successful completion of project tasks is quickly becoming accepted in many professional settings. One example of this is in software development where social networking services are used to facilitate collaborative development of software code across communities [1]. GitHub is one of the most commonly used services for asynchronous team-based software development, which provides a space for developers to store source code and interact with formal or informal collaborators to complete development projects. This platform is relatively unique compared to other social networks because it brings together professionals who work together to complete knowledge-based work, which provides an opportunity to investigate the diffusion of innovation using analytic approaches that leverage the abundance of data created by activity on GitHub.

Code on GitHub is stored in repositories, and the repository owner and collaborators make changes to the repository by committing their content. Three event types in particular are key for tracking public interest in a repository: forking, watching, and starring. Forks occur when a user clones a repository and becomes its owner. Sometimes forks are created by the original team of collaborators to manage significant code changes, but anyone can fork a public repository. Developers can watch a repository to receive all notifications of changes and star repositories to signal approval for the project and receive a compressed list of notifications. Forks are valuable for tracking the spread of innovation, and all three events (fork, watch, and star) have been used as measures of repository popularity.

In this paper, we demonstrate a repository popularity predictor that can forecast fork and watch demand for the subset of most active repositories by leveraging cross repository events. For a given repository, these events can be treated as a sequence to model the volume of innovation diffusion. For example, Figure 1 shows watch events corresponding to two different popular repositories on GitHub over a three year period. Our prediction approach relies on Recurrent Neural Networks (RNNs) which have been widely used in a variety of sequence learning problems including unsegmented handwriting generation [2] and natural language processing [3]. RNNs can process arbitrary-length sequences of inputs especially when the elements of the sequence are not independent, i.e., if there exists a hidden relationship among different sequence elements. Here, we employ one of the best performing sequence learning architectures, Long Short Term Memory (LSTM) [4]. Our experiments show that LSTM with cross repository information outperforms an ARIMA (Auto Regressive Integrated Moving Average) model that forecasts the future events for a repository using only its own past events. Our evaluation was conducted on a dataset composed of the public GitHub events and repository profiles from January 2015 through June 2017.

The remainder of this paper is organized as follows. Section II presents the related work on GitHub and popularity prediction in social media. Section III describes our dataset and our information encoding procedure. The LSTM architecture is introduced in Section IV. Results are provided in Section V and then we conclude the paper with a discussion of future work.

II Related Work

Although GitHub is relatively new, there have been many studies conducted on this social media platform. One locus of interest is understanding social behavior and teamwork in GitHub communities, using approaches such as regression modeling to investigate key drivers and behaviors in projects and teams [5, 6]. Ecosystems in social coding platforms, emerge from commonalities in programming language and topic, along with code dependencies; it is possible to study their evolution over time using networks extracted from the GitHub event data [7, 8].

In addition, there has been research investigating the impact of utilizing social coding platforms on the software development process [9, 10, 11, 12]. These studies highlight the benefits and challenges of completing complex software development projects in this space. Much of the work in this area utilizes data-driven approaches that leverage the available data to investigate behaviors such as onboarding, pullrequests, and documentation evolution[13, 14, 15]. While many of the studies on GitHub utilize data-driven techniques to investigate these phenomena, there are also several examples of survey and interview studies that aim to develop a more nuanced understanding of these events [16, 17, 18].

Prior work on GitHub popularity prediction demonstrated that the fraction of fork events a repository has received in the past is an effective heuristic for predicting the relative distribution of fork events across repositories in the future [19]. However this popularity-based model of network evolution was only used to predict the general structure of the repo-user network rather than future event sequences.

There has also been research on modifying the recurrent neural network architecture to improve prediction performance. Wu et al. recently introduced a new network architecture, Deep Temporal Context Networks, for predicting social media popularity [20]. Rather than using a single time representation, DTCN uses multiple temporal contexts, combined with a temporal attention mechanism, to improve performance over a standard LSTM at ranking the popularity of photos on Flickr. Other types of prediction techniques, such as point process models, have been used to predict tweet popularity, measured by retweeting [21]. The key contribution of our paper is illustrating the value of cross-repository information, regardless of the prediction model employed.

III Data Description

Our GitHub activity dataset consists of 14 event types: CommitComment, Create, Delete, Fork, Gollum, IssueComment, Issue, Member, Public, Pullrequest, PullrequestReviewComment, Push, Release, and Watch. These events can be categorized into three groups: contributions, watches, and forks. This paper only examines watches and forks since they are the most relevant to repository popularity. The watch event occurs when a user stars a repository, and the fork event creates a copy of a repository that the user can modify without changing the original.

Our dataset includes the period from January 2015 to June 2017. First we divide the time range into ten day periods to be converted into sequences of watch or fork events. For our study, we selected the 100 repositories with the highest number of watch and fork events, based on their event profiles. The component event information from the profile is included as a feature. Comparing components also reveals if there is an undirected path between repositories. Figure 2 shows our data sequence structure. $n_{t}$ is the number of either fork or watch events for each repository. $c_{t}$ is the ID of the components to which each repository belongs. The input to the network at $t$ is $x_{t}=\{n_{t},c_{t}\}$ and the output $y_{t}=n_{t}+1$ is the prediction result.

This sequential data is fed to the LSTM neural network in order to learn either the fork or watch patterns of each repository. After training the model, the prediction can be made continuously by inputting the new number of associated events in realtime. For example, number of watches or forks for time-step $t+1$ will be forecast based on the inputs at time-steps $[1,t-1]$ and the current number of forks or watches at time-step $t$ .

IV Method

In GitHub, most users contribute code to multiple repositories and may copy code from many external repositories. Thus, it is likely that observing the event sequence of one repository may provide information about the user’s activities on other repositories. Transfer entropy is a measure of influence in social media [22]; by testing for transfer entropy (also known as Granger causality) between event sequences, we observed that fusing information across multiple repository event sequences could be helpful. To perform this fusion, we needed a model that performs well with multidimensional time series data in order to simultaneously consider these joint trends; these considerations guided our choice toward recurrent neural networks (RNNs). In the context of time series forecasting, RNNs capture and information from the past inputs and employ them alongside with current input to predict future time steps. Although, RNNs can store a long sequence of past information theoretically, practically their memory is limited. Figure 3 illustrates the structure of a general RNN architecture.

The variables in Figure 3 are as follows:

$x_{t}$ represents the input data at time-step $t$ .

-

$h_{t}$ represents the hidden state at time-step $t$ , which depends on the previous hidden state as well as the current input.

-

$y_{t}$ represents the output at time-step $t$ .

-

$W_{xh}$ , $W_{hh}$ and $W_{hy}$ represent shared weights across each unrolled time-step.

The central distinguishing feature of RNNs lies in the hidden layer structure. These layers are in charge of capturing and using the past information from all previous time-steps. The computations are the same at each time-step, however they are applied to different inputs $x_{t}$ . Therefore the outputs are different as well. The shared computation process avoids over-fitting as well as reduces the total number of parameters. The error is computed based on the difference between the actual value extracted from the dataset and the concatenation of outputs from all layers. For this paper, we use the LSTM architecture [10] which is very versatile and has been shown to perform well on a wide variety of problems [2, 3], including prediction of trends in social media [20]. Our LSTM network model is implemented on top of Keras.

V Results

This section presents an evaluation of our model’s ability to forecast watch and fork event time series on our dataset, since these events are the best direct measure of repository popularity. We compare our model to ARIMA (Auto Regressive Integrated Moving Average). Our data set contains events from January 2015 through March 2017 for both event types; from this, we sample the 100 repositories with the highest fork and watch counts during this period. The data from these repositories was divided into 10 day intervals. 80% of the training data was used for model training and the remaining 20% was reserved for validation. We stop the training when the validation error does not change for 100 epochs. Figure 4 shows how the LSTM loop back size was calculated. Table I summarizes the dataset details and the hyperparameters used in our experiments.

V-A Benchmarks

Our benchmark is a standard ARIMA (Auto Regressive Integrated Moving Average) time series predictor that uses the past time series to forecast the future. To evaluate our method, the LSTM (with cross-repository information) is compared to the standard ARIMA model implemented with the Pyramid library, a statistical Python library that brings R’s auto.arima functionality to Python.

To evaluate our prediction performance, we compared the results of Root Mean Square Error (RMSE) over time and repositories. $Y_{r,t}$ is the actual number of fork or watch events for repository $r$ at time-step $t$ , and $\widehat{Y}_{r,t}$ is the predicted number of that type of event. The RMSE for repository r over time $[1,T]$ is:

[TABLE]

To evaluate prediction performance over all repositories, we calculated RMSE of all repositories at time-step t as follows:

[TABLE]

Here $R$ is the total number of repositories in the dataset. To evaluate the prediction performance over all repositories, we compare the performance of the LSTM predictor and ARIMA in terms of $RMSE_{r}$ and $RMSE_{t}$ as well as the total average RMSE as a single value. Table II shows that our proposed method, LSTM with cross repository information, outperforms the ARIMA time series prediction. The next section analyzes this result in more detail.

[TABLE]

V-B Analysis

Figures 5 and 7 present a breakdown of the prediction of watch and fork events according to the $RMSE_{t}$ metric, in which the performance of all repositories is averaged together. LSTM consistently exhibits a lower error over all time steps. The performance breakdown per repository for the $RMSE_{r}$ metric is less clear (Figure 6 and 8). The LSTM appears to be better at predicting the global event changes across all repositories over time; ARIMA is unable to capture this since it lacks the cross repository features. Figure 9 shows the specific predictions made by ARIMA and LSTM for each of the repositories; LSTM tends to predict more activity for the repositories with ARIMA predicting less.

VI Conclusion

This paper introduces an approach for leveraging cross-repository activity to forecast trends in GitHub repository popularity. We present results on incorporating cross-repository features into an LSTM sequence learning model that demonstrate that it outperforms a standard time series forecast done with ARIMA for predicting the timing of fork and watch events. The LSTM is clearly better at capturing general shifts in user activity across GitHub. In future work, we plan to combine the prediction of multiple event types into a single learned model to come up with a more cohesive measure of repository popularity. Also we believe that including past information about specific transfer entropy between repositories could further improve the prediction performance.

Acknowledgments

This research was supported by DARPA program HR001117S0018.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Begel, R. De Line, and T. Zimmermann, “Social media for software engineering,” in Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research . ACM, 2010, pp. 33–38.
2[2] A. Graves, “Generating sequences with recurrent neural networks,” ar Xiv preprint ar Xiv:1308.0850 , 2013.
3[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of NIPS , December 2014, pp. 3104–3112.
4[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997.
5[5] B. Vasilescu, D. Posnett, B. Ray, M. G. van den Brand, A. Serebrenik, P. Devanbu, and V. Filkov, “Gender and tenure diversity in Git Hub teams,” in Proceedings of the Annual Conference on Human Factors in Computing Systems . ACM, 2015, pp. 3789–3798.
6[6] N. Hajiakhoond Bidoki and G. Sukthankar, “A communication protocol for man-machine networks,” in ar Xiv: 1808.07975 [cs. MA] , 2018, pp. 1513–1522.
7[7] K. Blincoe, F. Harrison, and D. Damian, “Ecosystems in Git Hub and a method for ecosystem identification using reference coupling,” in Proceedings of the Working Conference on Mining Software Repositories . IEEE Press, 2015, pp. 202–207.
8[8] L. Singer, F. Figueira Filho, B. Cleary, C. Treude, M.-A. Storey, and K. Schneider, “Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators,” in Proceedings of the ACM Conference on Computer Supported Cooperative Work , 2013, pp. 103–116.