Predicting future stock market structure by combining social and financial network information
Th\'arsis T. P. Souza, Tomaso Aste

TL;DR
This paper presents a multiplex network approach combining social media and financial data to predict future stock market correlation structures with significantly improved accuracy, especially over long-term horizons.
Contribution
It introduces a novel model that integrates social and financial network information using link persistence and triadic closure, enhancing prediction of market structure.
Findings
Up to 40% out-of-sample performance improvement over benchmark models.
Social media data improves long-term market structure predictions.
Financial market structure is more predictable than social opinion structure.
Abstract
We demonstrate that future market correlation structure can be predicted with high out-of-sample accuracy using a multiplex network approach that combines information from social media and financial data. Market structure is measured by quantifying the co-movement of asset prices returns, while social structure is measured as the co-movement of social media opinion on those same assets. Predictions are obtained with a simple model that uses link persistence and link formation by triadic closure across both financial and social media layers. Results demonstrate that the proposed model can predict future market structure with up to a 40\% out-of-sample performance improvement compared to a benchmark model that assumes a time-invariant financial correlation structure. Social media information leads to improved models for all settings tested, particularly in the long-term prediction of…
| Lag | (%) | |||||
|---|---|---|---|---|---|---|
| 87 (0.33) | 21 (0.76) | 93 (0.11) | 34 (1.2) | 97 (0.064) | 4 (0.091) | |
| 87 (0.37) | 33 (1.2) | 93 (0.1) | 45 (1.5) | 95 (0.092) | 6 (0.14) | |
| 86 (0.39) | 48 (1.5) | 93 (0.11) | 60 (1.6) | 94 (0.11) | 8 (0.17) | |
| 86 (0.39) | 65 (2) | 93 (0.11) | 65 (1.9) | 93 (0.13) | 10 (0.21) | |
| 85 (0.41) | 85 (2.6) | 93 (0.11) | 66 (1.9) | 92 (0.15) | 11 (0.24) | |
| 85 (0.41) | 100 (3.2) | 93 (0.1) | 74 (2) | 91 (0.16) | 12 (0.27) | |
| 84 (0.42) | 120 (3.5) | 93 (0.1) | 70 (2.2) | 90 (0.18) | 13 (0.3) | |
| 84 (0.43) | 150 (4.3) | 93 (0.1) | 72 (1.9) | 89 (0.19) | 15 (0.33) | |
| 83 (0.44) | 180 (5.7) | 93 (0.1) | 74 (2.2) | 88 (0.21) | 16 (0.37) | |
| 83 (0.43) | 220 (6.3) | 93 (0.096) | 79 (1.9) | 87 (0.21) | 17 (0.4) | |
| 82 (0.43) | 260 (7.2) | 93 (0.094) | 78 (2) | 87 (0.22) | 18 (0.43) | |
| 82 (0.42) | 300 (7.9) | 93 (0.09) | 86 (2.4) | 86 (0.22) | 19 (0.45) | |
| 82 (0.43) | 330 (7.9) | 93 (0.09) | 95 (2.1) | 85 (0.22) | 20 (0.49) | |
| 81 (0.43) | 360 (9.2) | 93 (0.084) | 100 (2.4) | 84 (0.23) | 21 (0.51) | |
| 81 (0.43) | 390 (9.9) | 93 (0.083) | 110 (2.3) | 84 (0.24) | 22 (0.55) | |
| 81 (0.43) | 410 (10) | 93 (0.08) | 120 (3) | 83 (0.24) | 23 (0.58) | |
| 80 (0.43) | 440 (11) | 94 (0.079) | 130 (2.6) | 82 (0.25) | 24 (0.62) | |
| 80 (0.44) | 470 (12) | 94 (0.076) | 150 (3) | 82 (0.25) | 25 (0.67) | |
| 80 (0.46) | 500 (12) | 94 (0.072) | 160 (3.6) | 81 (0.27) | 26 (0.71) | |
| 80 (0.48) | 510 (12) | 94 (0.068) | 170 (3.7) | 80 (0.28) | 27 (0.79) | |
| *A likelihood ratio of indicates statistical significance at . | ||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Time Series Analysis · Stock Market Forecasting Methods · Complex Network Analysis Techniques
Predicting future stock market structure by combining social and financial network information
Thársis T. P. Souza
Tomaso Aste
Department of Computer Science, UCL, Gower Street, London, WC1E 6BT, UK
Abstract
We demonstrate that future market correlation structure can be predicted with high out-of-sample accuracy using a multiplex network approach that combines information from social media and financial data. Market structure is measured by quantifying the co-movement of asset prices returns, while social structure is measured as the co-movement of social media opinion on those same assets. Predictions are obtained with a simple model that uses link persistence and link formation by triadic closure across both financial and social media layers. Results demonstrate that the proposed model can predict future market structure with up to a 40% out-of-sample performance improvement compared to a benchmark model that assumes a time-invariant financial correlation structure. Social media information leads to improved models for all settings tested, particularly in the long-term prediction of financial market structure. Surprisingly, financial market structure exhibited higher predictability than social opinion structure.
keywords:
Financial Networks; Network Link Prediction; Correlation Structure Prediction; Information Filtering Networks; Correlation-Based Networks; Social Media
1 Introduction
Financial markets can be regarded as a complex network in which nodes represent different financial assets and edges represent one or many types of relationships among those assets. Filtered correlation-based networks have successfully been used in the literature to study financial markets structure particularly from observational data derived from empirical financial time series [1, 2, 3, 4, 5]. The underlying principle is to use correlations from empirical financial time series to construct a sparse network representing the most relevant connections. Analyses on filtered correlation-based networks for information extraction [6, 7, 3] have widely been used to explain market interconnectedness from high-dimensional data. Applications include asset allocation [8], market stability assessments [9], hierarchical structure analyses [2, 3, 4, 10, 11] and the identification of lead-lag relationships [12].
The majority of literature so far has focused on the analysis of financial time series. However, in recent years a large amount of information about financial markets has become available from exogenous sources such as social media. It is reasonable to conceive that changes in social media sentiment [13] and changes in asset prices might be related. Some previous studies have indeed demonstrated the existence of relationships which in some cases indicated that social media can be used to predict changes in asset prices [14, 15, 16, 17, 18, 19]. When new information hits the markets, investors may react either rationally or irrationally [20, 21]. They may express opinions on social media that can later become market actions, thus enabling opportunities to forecast future asset prices. However, it has also been highlighted that not all assets behave in the same way. Some are more influenced by social media sentiment, while others, on the contrary, are more influential on the social media sentiment [22]. Besides each single financial asset, we address in this study whether the entire stock market structure is related to the structure constructed from social media sentiment and whether there exist lead-lag relationships exist that can be used for forecasting one structure in terms of the other.
We use dynamical Kendall correlations computed over rolling windows to investigate the temporal evolution of market structure represented by filtered correlation-based networks constructed from stock market prices and from Twitter sentiment signals. We generate two networks: one from log-returns of stock prices and the other from Twitter sentiment. The two networks are treated as a multilayer problem with two layers of networks that share the same nodes but have different edge sets. We investigate whether financial market structure can be better predicted by combining past financial information with past social media sentiment information. The market structure forecasting problem is formulated as a link prediction problem where we estimate the probability of addition or removal of a link in the future based on information about the structure of the financial and social networks in the past.
2 Methods
2.1 Financial and Social Networks
We selected of the most capitalized companies that were part of the S&P500 index from 09/05/2012 to 08/25/2017. The list of these companies’ ticker symbols is reported in the Appendix A.1. For each stock the financial variable was defined as the daily stock’s log-return , where designates the closing price at time . The social media variable was defined as the the social media opinion of stock which was estimated as the total number of bullish daily tweets related to the stock at time . Twitter sentiment data were provided by PsychSignal.com [23]. In this dataset, a Twitter message was defined to be related to a given stock when its ticker symbol was mentioned. The dataset used only English language content and it was agnostic to the country source of the Twitter message. We have provided further descriptive analytics of the Twitter sentiment dataset used in related literature [14, 22].
Stock returns and social media opinion scores each amounted to a time series of length equals to 1251 trading days. These series were divided time-wise into windows of width trading days. A window step length parameter of trading days defined the displacement of the window, i.e., the number of trading days between two consecutive windows. The choice of window width and window step is arbitrary, and it is a trade-off between having analysis that is either too dynamic or too smooth. The smaller the window width and the larger the window steps, the more dynamic the data are.
To characterize the synchronous time evolution of assets, we used equal time Kendall’s rank coefficients between assets and , defined as
[TABLE]
where and are time indexes within the window and .
Kendall’s rank coefficients fulfill the condition and form the correlation matrix that served as the basis for the networks constructed in this paper. To construct the asset-based financial and social networks, we defined a distance between a pair of stocks. This distance was associated with the edge connecting the stocks, and it reflected the level at which they were correlated. We used a simple non-linear transformation to obtain distances with the property , forming a symmetric distance matrix .
We extracted the distinct distance elements from the upper triangular part of the distance matrix , which were then sorted in an ascending order to form an ordered sequence . Since we require the graph to be representative of the market, it is natural to build the network by including only the strongest connections. This is a network filtering procedure that has been successfully applied in the construction of asset graphs for the analyses of market structure [24, 25]. The number of edges to include is arbitrary, and we included those from the bottom quartile, which represented the 25% shortest edges in the graph (largest correlations), thus giving .
We denoted and as the set of edges constructed from the distance matrices derived from stock returns and social media opinion , respectively. Two networks were considered as two layers of a duplex structure where , and is the vertex set of stocks which is common to both layers.
2.2 Persistence
The state of an edge between vertices and in the financial layer at time was represented with the corresponding adjacency matrix element : a binary variable with indicating the existence of the edge and its absence. Analogously, the variable accounted for the presence or absence of edge in the social () layer. The variable indicates instead the presence of at least one edge between and in the two layers; indicates that no edges are present between and in any layer.
2.3 Triadic Closure
Let be the set of nodes that are common neighbors to vertices and . We defined the triadic closure of an edge at layer and time as the mean of the clustering coefficients of vertices in :
[TABLE]
where term is the clustering coefficient of node which accounts for the fraction of triads in the neighbors of that are closed in triangles. This is defined as
[TABLE]
where is the degree of vertex and is the neighborhood of .
In the multiplex case, we kept the same definition but allowed triangles to form across several layers [26, 27]. For the multiplex case, we used the symbol .
2.4 Link Prediction
We aim to predict the probability that an edge is inserted or removed in the financial network, , at a future time by using the information about the past structures of the financial and social networks at previous times . For this purpose we considered two mechanisms:
the tendency of an edge present at a previous time to persist in the future (edge persistence);
- 2)
the propensity of triangles within or across layers to close (triadic closure).
The mechanism of growth by triadic closure is based on a principle of transitivity, often observed in real-world networks, where there is a tendency to form triangles. Under this principle, two nodes tend to be connected if they share common neighbors with high transitivity, i.e., propensity to close triangles.
The probability that an edge will be inserted in the future is computed by means of a logistic regression of the edge persistence and the triadic closure coefficients. We estimated regression coefficients by best fitting on a training set which was composed of rolling windows of 126 trading days that initially ranged from 09/05/2012 to 09/10/2014. Predictions concerning the presence of edges in the financial network were made at to weeks ahead of the end of the training set. The test set initially ranged from 09/17/2014 to 08/25/2017. The procedure was repeated by moving the training window forward in 1-week steps.
The probability to observe vertices , connected by an edge at can be inferred in terms of the set of previous triadic closure coefficients, , and edge persistence scores . We first considered a restricted model that used financial information only, which is given by the following:
[TABLE]
For this restricted model, we performed a 1-step ahead prediction for weeks.
To calibrate the parameters in Eq. 4, we considered a training window of days which ends at time . The log-likelihood function [28] over the training window for the logistic model from Eq. 4 is given by
[TABLE]
We differentiated the log-likelihood function given by Eq. 5 in order to find maximum log-likelihood estimates for the coefficients of Eq. 4.
To verify whether the multiplex information is relevant in the prediction of links in the financial network compared to past a financial network alone, we considered a full regression model that takes the set of previous triadic closure coefficients and edge persistence from the financial layer (), social layer () and the multiplex network (). The full model is
[TABLE]
The log-likelihood function of the full model in Eq. 6 and the model fitting can be obtained in an analogous manner to the previously performed procedure for the restricted model from Eq. 4.
The likelihood ratio statistic is
[TABLE]
where and are,respectively, the maxima of the log-likelihood functions of the full and restricted models in the training set window. The likelihood ratio statistic can be assumed to follow a distribution [28] with 4 degrees of freedom where a value of is assumed to be statistically significant at . In that case, there is evidence to accept the full model that considers social and financial information over the restricted model that considers financial information only.
The model performance was estimated by counting both the true positives (edges predicted to be there and indeed present in the future network) and the false positives (edges predicted to be there but not present in the future network) and measuring of AUC (area under the receiver operating characteristic curve) in the test set that originally ranged from 09/17/2014 to 08/25/2017. AUC ranges from 0.50 to 1.00, with higher values indicating that the model discriminates better between the two categories of edge-present and edge-absent.
3 Results
3.1 Market structure dynamics
We first investigated financial network persistence by comparing the financial network at time with a future financial network, at steps ahead. To quantify the changes in the correlation network structure, we used two measures: A) the fraction of new edges in that were not present in ; B) the Jaccard Distance, defined as
[TABLE]
Results are reported in Fig. 2, panels A) and B), respectively.
Fig. 2 panel A) shows the mean percentage of new edges in the financial network at time with respect to the edge set at time ( trading weeks). We observe that edges change considerably in the financial network with almost 40% of edges in financial networks changing after a period of trading weeks. Fig. 2 panel B) shows the cross-similarity among financial networks measured as the Jaccard Distance between and with and ranging from 09/05/2012 to 21/02/2017. We observe that edge changes (persistence) are quite stable overtime, i.e., the number of edges that change is similar throughout the period. Hence, results indicate that the constructed financial networks are time-variant across the entire period studied, with a stable rate of edge changes over time.
3.2 Prediction of Stock Market Structure
We used Eq. 6 to predict a the financial network at a future time by using the information about the past structures of the financial and social networks at previous times . Fig. 3 panel A) shows the performance obtained in the prediction of out-of-sample edges for trading steps ahead. We achieved an overall high out-of-sample performance in financial network link prediction, with performances in the range of 73% to 95% depending on time-lag and time-period. Prediction power improved with a smaller time lag.
We compared our results to those obtained using a benchmark model that assumes that correlation structure is time-invariant, i.e., . The performance improvement against the benchmark is estimated as , where represents the performance of the proposed model and is the performance of the benchmark. From Fig. 3 panel B), we observe that the higher the time lag, the higher the performance improvement over the benchmark. Let us note that performance improvement over the naive benchmark reached values as high as 40% for a long-term prediction with a lag of 20 trading weeks.
Fig. 4 reports an aggregate overview of the previous results for the out-of-sample prediction in terms of the number of weeks ahead. We observe that as the lag increases, the prediction performance declines (panel A). However, the improvement in performance over the naive benchmark improves (panel B).
In Appendix A.2, we report the results obtained by using an expanding window rather than a rolling window as a training set. We observe that expanding the training set does not necessarily lead to better performance. In fact, the rolling window analysis yielded better performance overall.
To verify whether the multiplex network provides additional information to that from the financial network only, we re-computed the same out-of-sample edge prediction by using the financial network only and compared this to the results from the full model that considers both the financial and social information layers. A comparison between the two models was performed by comparing their respective likelihoods. We have also disaggregated the prediction of the insertion of new edges and the prediction of edge deletions . We report the likelihood values and AUC performance obtained for the fit of each model in Table 1.
We observed that the model that includes both financial and social information better fit the data compared to the model that considers financial data only, particularly for the case of the prediction of insertion of new edges. The likelihood ratio increases with prediction lag indicating that full models (i.e. those that consider both financial and social networks) are particularly important in long-term link prediction. Results confirm that the multiplex network is distinctly better than the single financial layer with all likelihood ratios having p-value for all configurations tested.
3.3 Prediction of Social Opinion Structure
We have so far established that social opinion structure can provide statistically significant information about the future financial market structure. In this section, we investigate the opposite relationship of whether financial market structure can also significantly improve the prediction of future social opinion structure, and we determine if this effect is larger or smaller.
The comparison between performance results is summarized in Fig. 5, where the prediction of social opinion structure is plotted together with the results for the prediction of financial market structure that was discussed previously. Surprisingly, results suggest that financial market structure has a higher predictability than social opinion structure. We also observe that both the financial network and social opinion network predictions lead to an improvement compared to the naive benchmark that considers time invariance in social network structure. As previously observed, the relative performance improvement increases with time lag. In this case, the relative improvement in prediction is higher for the social opinion structure than for the financial network as observed in Fig. 5 panel B).
One of the possible reasons why social opinion structure is less predictable compared to financial network structure is the higher structural variability of the former compared to the latter. Fig. 6 provides evidence that social media structure is less stable than financial market structure in terms of the number of edge changes over time. More edges changed in the social opinion network than in the financial network for all lags tested. We observed that more than 50% of the edges in the social media opinion structure changed compared to 40% in the financial network over a time lag of 20 trading weeks.
4 Discussion and Conclusions
We investigated whether financial market structure can be better predicted by combining past financial information with past social media sentiment information. We considered the most capitalized companies that were part of the S&P500 index in the period between May 2012 and August 2017. We generated two networks: A financial network constructed from log-returns of equity prices and a social network constructed from Twitter sentiment analytics. We constructed filtered correlation-based networks by keeping the strongest top quartile correlations only that considered a rolling window of trading days. The two networks were treated as a multiplex problem with two layers of networks that share the same nodes (stocks) but have different edge sets.
The financial market structure forecasting problem was formulated as a link prediction problem where we estimated the probability of the addition or removal of a link in the future on information about the past structure of financial and social opinion networks.
We proposed that financial network links were formed by a combination of the two mechanisms of triadic closure and edge persistence. The first mechanism assumes that two stocks have a propensity to be correlated if they share common neighbors. The edge persistence mechanism assumes that two connected stocks tend to remain connected in the future. A logistic model was trained over a set of data between 09/05/2012 and 09/10/2014 and then results were reported for the validation set over the following period from 09/17/2014 and 08/25/2017.
Our results indicate that financial market structure is considerably time variant, which invalidates the commonly used assumption of time invariance in the determination of stock correlation structure. The proposed model exhibited high out-of-sample performance in financial network link prediction, particularly in the case of long-term predictions where we observed a performance improvement of up to 40% over a naive benchmark that assumed that the correlation structure of the financial market was time invariant. Likelihood ratio analysis demonstrated that models that considered both financial and social information better fit the data when compared to a restricted model that considers financial information only. This provides evidence that supports the use of social information in the prediction of financial market structure.
Finally, our findings indicate that social opinion structure is less stable than financial market structure. Surprisingly, the prediction of financial market structure using past social and financial information presented higher performance compared to the problem of predicting social opinion structure using past social and financial information.
Let us note that network link formation can occur due to mechanisms beyond the ones we studied here. For instance, networks can form links as a result of a growth process that adds new nodes in the network, e.g., IPOs can generate growth in a financial network. Among other possible mechanisms, link formation can occur due to preferential attachment, a phenomenon widely observed in real networks where new nodes tend to link to the more connected ones [29].
In summary, this study indicates that social opinion structure is relevant to the prediction of future financial correlation structures. This result has important consequences because of the fundamental importance of financial correlation structure in Modern Portfolio Theory (MPT) [30], Capital Asset Pricing Model (CAPM) and Arbitrage Pricing Theory (APT) [31]. Future work should focus on the investigation of further mechanisms of financial link formation and on applications in portfolio allocation strategies.
5 Acknowledgments
This work was supported by PsychSignal.com, which provided social media data. T. Aste acknowledges support of the UK Economic and Social Research Council (ESRC) in funding the Systemic Risk Centre (ES/K002309/1). T.T.P. Souza acknowledges financial support from the Brazilian National Council for Scientific and Technological Development (CNPq).
Appendix
A.1 Ticker Codes of Selected Companies
AAPL, AMZN, NFLX, MSFT, GS, GOOGL, BAC, JPM, IBM, DIS, GILD, INTC, YHOO, WMT, GE, XOM, SBUX, CSCO, WFC, NVDA, PCLN, JNJ, MCD, NKE, BA, VZ, ES, PFE, KO, CVX, CAT, MU, MRK, CELG, EBAY, MS, CRM, FCX, QCOM, TGT, HD, CHK, BMY, AMGN, PG, HPQ, ORCL, FSLR, WFM, COST, BIIB, PEP, EA, AXP, WYNN, CMCSA, CL, AIG, DOW, NEM, MA, BBY, COP, LOW, TWX, ADBE, HAL, LLY, UNH, LUV, MMM, CVS, MO, FDX, DD, ED, KR, MON, UTX, ABT, SLB, YUM, MCO, AMAT, EXPE, AET, DE, GPS, UPS, VLO, CBS, HAS, COH, ALL, WDC, JWN, TXN, PM, UNP, EOG.
A.2 Prediction Results Using an Expanding Window Training Set
In this section, we report results using models that were trained in an expanding window, instead of a rolling window, using initial start and end dates of 09/05/2012 and 09/10/2014, respectively. The test period ranges from 09/17/2014 to 08/25/2017.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Tumminello, S. Miccichè, F. Lillo, J. Piilo, R. N. Mantegna, Statistically validated networks in bipartite complex systems , P Lo S ONE 6 (3) (2011) 1–11. doi:10.1371/journal.pone.0017994 . URL http://dx.doi.org/10.1371%2Fjournal.pone.0017994 · doi ↗
- 2[2] R. N. Mantegna, Hierarchical structure in financial markets , The European Physical Journal B - Condensed Matter and Complex Systems 11 (1) (1999) 193–197. doi:10.1007/s 100510050929 . URL http://dx.doi.org/10.1007/s 100510050929 · doi ↗
- 3[3] T. Aste, W. Shaw, T. Di Matteo, Correlation structure and dynamics in volatile markets, New Journal of Physics 12 (8) (2010) 085009.
- 4[4] M. Tumminello, F. Lillo, R. N. Mantegna, Correlation, hierarchies, and networks in financial markets , Journal of Economic Behavior & Organization 75 (1) (2010) 40 – 58, transdisciplinary Perspectives on Economic Complexity. doi:http://dx.doi.org/10.1016/j.jebo.2010.01.004 . URL http://www.sciencedirect.com/science/article/pii/S 0167268110000077 · doi ↗
- 5[5] M. Tumminello, T. Aste, T. Di Matteo, R. N. Mantegna, A tool for filtering information in complex systems , Proceedings of the National Academy of Sciences of the United States of America 102 (30) (2005) 10421–10426. ar Xiv:http://www.pnas.org/content/102/30/10421.full.pdf , doi:10.1073/pnas.0500298102 . URL http://www.pnas.org/content/102/30/10421.abstract · doi ↗
- 6[6] T. Aste, W. Shaw, T. D. Matteo, Correlation structure and dynamics in volatile markets , New Journal of Physics 12 (8) (2010) 085009. URL http://stacks.iop.org/1367-2630/12/i=8/a=085009
- 7[7] W.-M. Song, T. Aste, T. Di Matteo, Analysis on filtered correlation graph for information extraction, Statistical Mechanics of Molecular Biophysics (2008) 88.
- 8[8] F. Pozzi, T. Di Matteo, T. Aste, Spread of risk across financial markets: better to invest in the peripheries, Scientific reports 3.
