Increasing trend of scientists to switch between topics
An Zeng, Zhesi Shen, Jianlin Zhou, Ying Fan, Zengru Di, Yougui Wang,, H. Eugene Stanley, Shlomo Havlin

TL;DR
This paper investigates how scientists switch between research topics over their careers, revealing increased switching frequency over time and its complex relationship with productivity and citations.
Contribution
It introduces a novel analysis of topic switching dynamics using shared references and proposes a model explaining these behaviors.
Findings
Scientists tend to have a narrow range of research topics.
Topic switching has increased over the years.
High early-career switching correlates with lower productivity.
Abstract
We analyze the publication records of individual scientists, aiming to quantify the topic switching dynamics of scientists and its influence. For each scientist, the relations among her publications are characterized via shared references. We find that the co-citing network of the papers of a scientist exhibits a clear community structure where each major community represents a research topic. Our analysis suggests that scientists tend to have a narrow distribution of the number of topics. However, researchers nowadays switch more frequently between topics than those in the early days. We also find that high switching probability in early career (<12y) is associated with low overall productivity, while it is correlated with high overall productivity in latter career. Interestingly, the average citation per paper, however, is in all career stages negatively correlated with the switching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Increasing trend of scientists to switch between topics
An Zeng1, Zhesi Shen2, Jianlin Zhou1, Ying Fan1, Zengru Di1,
Yougui Wang1,∗, H. Eugene Stanley,3,∗ and Shlomo Havlin4,∗
1School of Systems Science, Beijing Normal University, Beijing 100875, China
2National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215
2Department of Physics, Bar-Ilan University, Ramat-Gan 52900, Israel
∗To whom correspondence should be addressed;
E-mail: [email protected] (Y.W.), [email protected] (H.E.S.), [email protected] (S.H.)
We analyze the publication records of individual scientists, aiming to quantify the topic switching dynamics of scientists and its influence. For each scientist, the relations among her publications are characterized via shared references. We find that the co-citing network of the papers of a scientist exhibits a clear community structure where each major community represents a research topic. Our analysis suggests that scientists tend to have a narrow distribution of the number of topics. However, researchers nowadays switch more frequently between topics than those in the early days. We also find that high switching probability in early career () is associated with low overall productivity, while it is correlated with high overall productivity in latter career. Interestingly, the average citation per paper, however, is in all career stages negatively correlated with the switching probability. We propose a model with exploitation and exploration mechanisms that can explain the main observed features.
Introduction
Uncovering the mechanisms governing research activities of individual scientists and their evolution with time is critical for understanding and managing a wide range of issues in science, from training of scientists to collective discovery of new knowledge (?, ?, ?, ?, ?). The digital publishing era has led to a revolution in science embodied in big data that captures major activities in research. This creates an unprecedented opportunity to explore the dynamical patterns of scientific production and reward using state-of-the-art mathematical and computational tools (?, ?, ?). Apart from the early works aiming at evaluating scientific impact with scientists’ citations (?), -index (?) and related variants (?), there is a recent wave of studies focusing on quantifying and modeling the evolution of research creativity throughout scientists’ careers (?, ?, ?, ?, ?, ?, ?, ?). Scientists’ cumulative production has been shown to exhibit persistent growth with time (?), which is associated with the well-known Matthew effect (?). By associating each publication with its citation, it has been revealed that the most influential work of a scientist appears randomly within the sequence of her publications (?). A follow-up work investigated the timing of top- most influential papers of an individual researcher, revealing that scientists’ career may involve a hot streak period during which an individual’s performance is substantially higher than her typical performance (?). Other issues such as the evolution of scientists’ creativity (?), reputation (?), social ties (?) and mobility (?, ?) over their careers have also been investigated.
A fundamental driving force of scientific research is the evolution of scientists’ research interest (?), which is reflected in the switching of scientists between different research topics over time. Sociologists of science have made persistent effort in qualitative understanding the principles governing the topic selection of scientists, and pointed out that it may result from a trade-off between conservative production and risky innovation (?). There are also rich illustrative models proposed by sociologists to categorize the research strategies adopted by scientists (?). With the increasing availability of scientific publication data, the issue of topic selection started to be analyzed quantitatively in recent years. It has been pointed out that the research interest of individual physicists could shift significantly from the beginning to the end of the career, with the distance between interests being measured based on field classification codes in physics (?). However, the variation of topic switching during the individual career has not been studied so far. Here we ask: How to identify the topics that an individual scientist is involved? How frequent a scientist switches between different research topics? Does more frequent switching of scientists between topics help their impact? Does the topic switching behavior of scientists change during the past century?
To address these questions, we construct a network for each scientist characterizing the relations between her papers. The structure of this network will immediately reveal how an individual scientist’s research interests are embodied. This framework allows us, applying community analysis, to specify the various research interests and accordingly investigate the detailed dynamics of the research interest shifting of a scientist, as well as the switching tendency evolution during the last century and its relation to research impact. Our analysis suggests that scientists tend to have a narrow distribution of the number of major topics during their life time. We find that the typical number of major topics during last century stays almost unchanged. However, researchers in the early days tend to work in a topic for a longer time before switching to another topic, while nowadays they tend to work on multiple topics simultaneously. Interestingly, we find that more frequent switching between topics in the early career () is related to lower research performance, i.e., both the overall productivity and mean citation are lower. In marked contrast, more frequent switching in the latter career is associated with higher overall productivity but with lower mean citations. We propose a model reproducing the main observed empirical patterns. Our framework, although applied here to physicists and computer scientists, is general and not restricted to availability of field classification codes, so it can be applied to analyzing scientists from any discipline.
Results
In this paper, we analyze the scientific publication data of the American Physical Society (APS) journals. Disambiguated author name data provided in (?) is used to assign each paper to its authors, which results in the publication records of 236,884 distinct scientists (for basic statistics of this data see Fig. S1 of Supplementary Materials (SM)). In order to investigate how the papers of an individual scientist are related, we construct for each scientist a co-citing network (CCN) in which each node is a paper authored by this scientist and two papers have a link if they share at least one reference. This approach of constructing links between nodes (papers) based on their common neighbors is called bibliographic coupling in Scientometrics (?, ?) and has also been widely used in the analysis of various other real systems such as international trading systems (?) and online social systems (?). The communities of each co-citing network of a scientist are identified with the fast unfolding algorithm which detects communities by maximizing the modularity function (?). Typically, a network contains several large-size communities as well as some small clusters and isolated nodes. The major communities represent the main research topics of this scientist. As the network size needs to be large enough to ensure meaningful community detection results, we consider in this study all scientists that have published at least 50 papers in the APS journals (3,420 scientists in total, for the distribution of their career start years see Fig. S2). Results for scientists with fewer papers (at least 20 papers, 15,373 scientists) are similar and are reported in Figs. S10 and S11 of the Supplementary Materials (SM). In addition, we have studied the communities detected in the weighted co-citing network where links are weighted according to the number of shared references. The community structure is not significantly altered when considering the link weights (see Fig. S3), as large weights tend to locate on the links within communities. Our community analysis has also been examined based on a modified modularity function with higher resolution parameter (see Figs. S12 and S13 in SM) and on another data set from computer science (see Figs. S15 and S16 in SM) and for all tests, the main conclusions have been found to be similar.
Illustration of the co-citing network of a typical highly-cited scientist is given in Fig. 1. The community connectivity matrix in Fig. 1c shows that nodes within each community are well connected, yet nodes between communities are much less connected. The time series presented in Fig. 1d describes the growth history of the network and reveals how this scientist moves from one research topic to another during his career. In the time series, each point is a paper and different colors represent different communities in the co-citing network. The height of the point is the number of links (i.e. degree) that the paper has in the network. The analyses in our study are mainly based on the co-citing networks and time series of scientists.
We first focus on the structural properties of the co-citing networks (CCNs). For each scientist’s CCN, we calculate the size of its giant component (GC) and study its correlation with the network size, as shown in the scatter plot presented in Fig. 2a. It is seen that most of the points are located close to the diagonal line, indicating that CCNs are in general well connected and have relatively large GCs (see Fig. S4 in SM for the results with the network including also the co-cited relations between papers). This is also seen in the inset of Fig. 2a where a significant right-skewed distribution (close to 1) of the relative size of GC is observed. Fig. 1c suggests that a CCN has a community structure. As a statistical support for this phenomenon, we plot in Fig. 2b the maximized modularity, , in real CCNs and the maximized modularity, , in their degree-preserved reshuffled counterparts. All points are located under the diagonal line, indicating that the community structure in real CCN is truly significant.
Given that papers tend to cluster into communities in CCN, one interesting question is what is the typical number of communities that a scientist has. We show in Fig. 2c the distribution of the number of communities for all scientists. The number of communities is seemingly broadly distributed. However, as CCNs may consist of isolated nodes or very small clusters, we use a threshold to eliminate communities that are too small to be regarded as a research field of a researcher. After filtering, the distributions of the number of communities that a scientist has become very narrow, with the peaks around 4 and 3 if communities with only sizes larger than 2 and 5 are considered respectively. In the following analysis, we define major communities as such of more than two nodes. To better understand the community size in CCNs, we show in Fig. 2d the fraction of papers in each community sorted by size in descending order. The strong decay of the curve indicates that several major communities comprise most of the nodes. A further investigation of the inverse cumulative probability of fraction of nodes in the several largest communities indicates that for half of the scientists, the three largest communities include over 70% of their papers, as seen in Fig. 2e.
In each CCN, a major community contains papers that are topologically close to each other. In order to validate whether the papers in a community are indeed in similar research topics (?, ?), we analyze the PACS code (a field classification code in physics) of the papers belonging to the same community. We show in Fig. 2f the Gini coefficient (?) of the distribution of PACS codes in different communities. A larger Gini coefficient corresponds to a more heterogeneous distribution of the PACS codes in a community. The real data is compared with a random counterpart where the PACS codes are reshuffled among each individual scientist’s papers while the community structure is preserved. We show in Fig. 2f the mean Gini coefficient in each community sorted by size in descending order. We find that the mean Gini coefficient in real data is higher than that in the random counterpart, with a p-value smaller than 0.01 in the Kolmogorov-Smirnov test of the corresponding Gini coefficient distributions. Thus, our results suggest that papers in a community tend to share the same PACS codes and the detected communities reflect distinct research fields of a scientist.
Once the detected communities are marked in the time series (Fig. 1d), the dynamics of scientists’ interest across different research topics can be investigated. To this end, we first show in Fig. 3a, the mean number of yearly involved major communities for each scientist. It can be seen that scientists tend to be involved in small number of communities during their early career. Then the number of yearly involved communities increases until it peaks around the year of the career, and gradually decreases after that. However, when a scientist publishes more papers in a year, she might have a higher number of yearly involved communities purely by chance. To remove the effect of number of yearly published papers (see Fig. S5 in SM), we propose another metric called switching probability which computes the probability of a scientist to switch from one major community to another major community between two adjacent publications. Fig. 3b shows the evolution of the mean switching probability in different career years. The peak of switching probability is also around the career year, indicating that scientists tend to switch less during their early career while switch more in the later stage of their career, which is consistent with the trend observed with the yearly involved communities.
We further ask, does increasing switching helps research performance or not? To this end, we investigate the correlation between the switching rate and research performance. Here, we measure the research performance of a scientist using two almost uncorrelated metrics (see Fig. S8), i.e., number of published papers and mean citation per paper. Consistent with ref. (?), we only consider the number of citations 10 years after a paper is published, i.e. . We first compare in Fig. 3c, the overall switching probability with the switching probability of the 10% most productive scientists in different career years. We find surprisingly two opposite behaviors. In the early career stage () high overall productivity is associated with low switching probability yet in later career stage high productivity is associated with higher switching probability. In addition, we compare in Fig. 3d, the overall switching probability with the switching probability of the 10% scientists who has the highest mean citation per paper. The figure shows that high average citation per paper in all career periods is associated with low switching probability. This interesting finding might be due to the fact that higher switching probability reduces the impression of leadership in a specific field, yielding less citations. This result is also highly supported by an additional test where the switching probability is found to be negatively correlated with mean citation per paper, especially for productive scientists (see Fig. S9 in SM). To examine the significance of these findings, we carry out the Kolmogorov-Smirnov test of the switching probability distribution in each career year. The small p-value shown in the insets of Figs. 3c and 3d (mostly ) suggests that the overall (total population) switching probability indeed follows a distinct distribution from each of the two sub-groups of scientists (i.e. 10% most productive and 10% most highly cited per paper) in each career year. We additionally calculate the Pearson correlation between scientists’ switching probability in different career years and their overall productivity, as well as the Pearson correlation between scientists’ switching probability in different career years and the mean citation per paper. The correlations presented in Fig. S6a and S6b also highly support the findings revealed in Fig. 3c and 3d.
Next we study how the structural and dynamical properties of CCNs evolve as the development of science in the last 100 years. As our data ends in 2010, the careers of some scientists are not completed. We thus have to fix the career length of the scientists from different years in order to ensure a fair comparison between their CCNs. Specifically, we only consider scientists’ first career years and remove (i) all the scientists who did not yet reach years career and (ii) those who published less than 30 papers in their first career years. In our analysis, we present results of . We first select the scientists who started their careers in a certain year and average the number of major communities that these scientists have been involved in their careers. We show in Fig. 4a the mean number of communities for the scientists who started their career in different years. The results indicate that as science evolves, the number of major communities of individual scientists stays almost unchanged. The evolution of other structural properties of CCNs is presented in Fig. S7. We further calculate the mean switching probability of each scientist over her career and accordingly compute the mean switching probability per year by averaging the switching probability of all scientists who started their career in this year. We show in Fig. 4b the average switching probability of scientists who started their career in different years. The results surprisingly indicate that although the number of communities is stable over years, scientists tend to increase switching between communities, i.e., topics, during last century. More specifically, scientists in the earlier days tend to work in a topic for a longer period before switching to another topic. On the contrary, scientists nowadays tend to work on multiple topics almost simultaneously, resulting in more frequent switching between communities almost in each pair of adjacent publications. We then test the significance of our observed trends by directly studying the distributions of number of communities and the switching probability for two groups of scientists. The first group includes the scientists who started their careers between 1950 and 1960, while the second group contains the scientists who started their careers between 1970 and 1980. One can see in Fig. 4c that the distributions of number of communities for these two groups of scientists largely overlap. The distributions of the switching probability for these two groups of scientists in Fig. 4d, however, exhibit significant difference.
We finally propose a model that could help to understand the main mechanisms leading to the observed patterns of scientists’ research dynamics. The research activities of scientists can be modeled as discovery process in the knowledge space (i.e. a network characterizing the connections among different knowledges) (?, ?). When a scientist publishes a paper, she activates a node (i.e. a new knowledge) in the knowledge space. The sub-network activated by this scientist during her career forms a personal network recording all her papers as well as the links, i.e., relations between them. The simplest model for the node activation process is the standard random walk, assuming that a scientist randomly activates a neighboring node of the former activated node. Here, we propose an Exploitation-Exploration model (EEM) by introducing an exploitation process (controlled by a probability ) and an exploration process (controlled by a probability ) to the random walk model. Both processes have been pointed out to be fundamental for innovation in various adaptive systems (?). In our model, these two processes are performed sequentially. Instead of always starting from the last activated node in each step, the scientist has probability to randomly restart from (re-exploit) one of the previously activated nodes. Once the re-exploited node is determined, the scientist has probability to explore nodes beyond the nearest neighbors. For simplicity, we assume that the scientist randomly activates in the exploration step a next-nearest neighbor. Note that the EEM reduces to the standard random walk model when and . For an illustrative demonstration of the random walk model and the EEM, see Fig. 5a. In our simulation, the knowledge space is represented as a network consisting of all the APS papers, with any two nodes (papers) linked if they share at least one reference. The first activated node for each scientist is set to be her first paper. The rest of the papers of each scientist are generated by following the EEM on the APS network until the number of activated nodes equals to the real number of papers of each scientist.
We first test the EEM by simulating the research dynamics of the representative highly-cited scientist presented in Fig. 1. Specifically, we compare in Fig. 5b the co-citing network (CCN) as well as the time series of published papers generated by both, the standard random walk model and the EEM. The the initial paper and the number of papers in each year of this simulated scientist are set the same as in the real data. One can immediately see that the network generated applying the standard random walk model is very different from the typical real one in Fig. 1b as it contains many long chains and it lacks distinct communities. Moreover, the time series obtained from the random walk model is also very different from that of typical real researcher shown in Fig. 1d in the sense that no switching between communities can be observed in each year. In contrast, both the network and the time series generated by the EEM qualitatively reproduce similar properties as those exhibited in Fig. 1. We further support quantitatively the EEM by examining some statistical quantities generated by this model. The first relates to the number of yearly involved communities under different , as presented in Fig. 5c. When , each scientist roughly works in only one community each year. As increases, the number of yearly involved communities becomes larger, with peaking around 1.8 which is the value observed in real data. We have tested and found that has little effect on the yearly involved communities, thus it is set to be [math] in Fig. 5c. Another statistical quantity is the number of communities that each scientist is involved during her research career. When , the generated sub-network does not have distinct communities and thus the number of communities is very narrowly distributed (even for case where all detected clusters are regarded as communities), as shown in Fig. 5d. As increases, small communities start to emerge, resulting in the separation of the distributions of the , and cases. When , the distributions of , and cases respectively peak around 11, 8 and 5, similar to that in real data, see Fig. 3a. We have also found that has a little effect on the distribution of the number of communities, thus we set for Fig. 5d.
We finally estimate the probability and for each scientist based on real data. We denote the number of papers published by a scientist as . When each of these papers is published, if it shares no reference with any of ’s papers published before, we keep a record of this paper and finally denote as the total number of such papers. then can be easily estimated as . In the sequence of the ’s papers, if a paper shares at least one reference with the former paper published by , we keep a record of this paper and finally denote as the total number of such papers. In this way, we can estimate as . The distributions of the estimated and from real data are shown in Figs. 5e and 5f. One can see that the distributions of and peaks around and respectively, which are the same as the values in Figs. 5c and 5d that generate consistent statistical properties with real data.
Discussion
To summarize, we study the research dynamics of scientists by constructing a network of each individual scientist’s publications characterizing their co-citing relations. We find that typically each network appears to have a clear community structure. The papers in a community tend to share the same PACS code, indicating that each community indeed represents a research area. By filtering out the small communities of less than 3 nodes, we obtain the major communities of scientists. We find that the numbers of major communities of scientists during their career are narrowly distributed. In addition, the largest three communities already comprise over 70% of most scientist’s papers. We compare the statistical properties of the co-citing networks of scientists who started their career in different years. We find that though the total number of communities stays almost unchanged, yet the switching between communities tends to increase and becomes more frequent during the years. In addition, we find that high average citation per paper in all career stages correlates with low switching probability. In marked contrast, high switching probability in early career correlates with low overall productivity, while high switching probability in latter career is associated with high overall productivity. Finally, we propose a model capturing the main features of the research dynamics of individual scientists. The research activity is modeled as a node activation process in a knowledge network where nodes represent all the papers in APS and links represent co-reference relations between nodes. The model reproduces the main structural and dynamical patterns of individual scientist’s publishing behavior by assuming the scientist activates nodes in the network based on a random walk process which includes the exploitation and exploration mechanisms.
Our work provides a general framework for incorporating network tools into the temporal analysis of publication records of individuals. Several promising extensions can be built on this work. A straightforward one would be constructing papers’ network for departments or institutions, which will help us to estimate the cooperativity behavior in the department. The higher-level research dynamics of these departments or institutions might be fundamentally different from the research dynamics at individual scientist level, the study of which will substantially deepen our understanding of how research activities are collectively organized. Similarly, one can investigate the networks characterizing relations among the papers published under the support of cooperative or individual research grants. The outcome of a research grant can thus be evaluated not only based on the number of papers but also be based on the actual research directions and the cooperation between scientists.
Materials and methods
Data. In this paper, we analyze the publication data from all journals of American Physical Society (APS). The data contains 482,566 papers, ranging from year 1893 to year 2010. For the sake of author name disambiguation, we use the author name dataset provided by Sinatra et al. which is obtained with a comprehensive disambiguation process in the APS data (?). Eventually, a total number of 236,884 distinct authors are matched. We found and analyzed 3,420 authors with at least 50 papers, and 15,373 authors with at least 20 papers. Another set of data that we analyzed in the supplementary materials is the computer science data obtained by extracting scientists’ profiles from online Web databases (?). The data contains 1,712,433 authors and 2,092,356 paper, ranging from year 1948 to year 2014. The author names in this data are already disambiguated. We found and analyzed 9,818 authors in this data with at least 50 papers.
Community detection. The co-citing network of a scientist is constructed by linking two papers if they share at least one reference. For simplicity, we do not weight the links and only consider the topology of the network. The community structure of the network is detected with the fast unfolding algorithm (?) which is a heuristic method based on modularity optimization. The modularity function considered in this paper is defined as
[TABLE]
where is the adjacency matrix of the network, is the degree of node , is the total number of links in the network, is the community to which node is assigned, the function is 1 if , and 0 otherwise. The communities are obtained when the function is maximized. Note that is a resolution parameter in (?, ?), with in the standard modularity function (?). A larger results in detecting small but more communities, while a smaller yields larger but fewer communities. Results with are presented in the supplementary materials. Although the distribution of the number of communities is influenced by the parameter (see Fig. S12), the dynamics properties are shown to be almost independent of the resolution of communities (see Fig. S13). For this reason, we consider the standard modularity function, i.e. , in this paper.
Acknowledgments
We thank Junming Huang and and Louis Shekhtman for useful discussions. This work is supported by the National Natural Science Foundation of China (Grant Nos. 61603046, 61773069, 71731002 and 61573065) and the Natural Science Foundation of Beijing (Grant No. L160008). ZS is supported by China Postdoctoral Science Foundation under Grant 2017 M620944. HES acknowledges the support from NSF Grants PHY-1505000, CMMI-1125290, and CHE-1213217, and DTRA Grant HDTRA1-14-1-0017. SH acknowledges the Israel-Italian collaborative project NECST, the Israel Science Foundation, U.S. Army Research Office contract number W911NF1810396, ONR, the Israeli Most and Japan Science Foundation, BSF-NSF, and DTRA (Grant No. HDTRA-1-10-1-0014) for financial support.
Author contributions
AZ, YW, HES, SH designed the research, AZ, ZS and JZ performed the experiments, AZ, YW and SH analyzed the data, all authors wrote the manuscript.
Competing financial interests
The authors declare no competing financial interests.
Data and materials availability
The data used in this paper are all publicly accessible. The APS data can be downloaded via https://journals.aps.org/datasets, and the computer science data can be downloaded via https://www.aminer.cn/aminernetwork.
Supplementary materials
Figs. S1 to S16
Figures
**Supplementary Information
** Increasing trend of scientists to switch between topics
An Zeng, Zhesi Shen, Jianlin Zhou, Ying Fan, Zengru Di,
Yougui Wang, H. Eugene Stanley, and Shlomo Havlin
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 11. A. Zeng, et al. The science of science: from the perspective of complex systems, Phys. Rep. 714-715, 1 (2017).
- 22. M. Qi, et al. Standing on the shoulders of giants: the effect of outstanding scientists on young collaborators careers, Scientometrics 111, 1839 (2017).
- 33. T. Amjad, et al. Standing on the shoulders of giants, J. Informetr. 11, 307 (2017).
- 44. A. Rzhetsky, J. G. Foster, I. T. Foster and J. A. Evans, Choosing experiments to accelerate collective discovery, Proc. Natl. Acad. Sci. USA 112, 14569 (2015).
- 55. M. D. Domenico, E. Omodei and A. Arenas, Quantifying the diaspora of knowledge in the last century, Appl. Netw. Sci. 1, 15 (2016).
- 66. A. Clauset, D. B. Larremore and R. Sinatra, Data-driven predictions in the science of science, Science 355, 477 (2017).
- 77. S. Fortunato, et al. Science of science, Science 359, eaao 0185 (2018).
- 88. T. Kuhn, M. Perc, and D. Helbing, Inheritance Patterns in Citation Networks Reveal Scientific Memes, Phys. Rev. X 4, 041036 (2014).
