Tracing Networks of Knowledge in the Digital Age

Mirco Musolesi

arXiv:1703.01476·cs.CY·August 9, 2019

Tracing Networks of Knowledge in the Digital Age

Mirco Musolesi

PDF

Open Access

TL;DR

This paper discusses how digital traces from social media, repositories, and mobile data enable mapping and understanding the global spread of knowledge, highlighting challenges and available analytical tools.

Contribution

It provides an overview of methods and tools for analyzing digital traces to study knowledge dissemination at large scales.

Findings

01

Digital traces reveal patterns of knowledge spread.

02

Mapping knowledge networks faces data and methodological challenges.

03

Tools exist for analyzing large-scale digital knowledge flows.

Abstract

The emergence of new digital technologies has allowed the study of human behaviour at a scale and at level of granularity that were unthinkable just a decade ago. In particular, by analysing the digital traces left by people interacting in the online and offline worlds, we are able to trace the spreading of knowledge and ideas at both local and global scales. In this article we will discuss how these digital traces can be used to map knowledge across the world, outlining both the limitations and the challenges in performing this type of analysis. We will focus on data collected from social media platforms, large-scale digital repositories and mobile data. Finally, we will provide an overview of the tools that are available to scholars and practitioners for understanding these processes using these emerging forms of data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Network Analysis Techniques · Human Mobility and Location-Based Analysis · Opinion Dynamics and Social Influence

Full text

Tracing Networks of Knowledge in the Digital Age

Mirco Musolesi

University College London and The Alan Turing Institute

Abstract

The emergence of new digital technologies has allowed the study of human behaviour at a scale and at level of granularity that were unthinkable just a decade ago. In particular, by analysing the digital traces left by people in their online and offline lives, we are able to trace the spreading of knowledge and ideas at both local and global scales.

In this article we will discuss how these digital traces can be used to map knowledge in online and offline worlds, outlining both the limitations and the challenges in performing this type of analysis. We will focus on data collected from social media platforms, large-scale digital repositories and mobile data. Finally, we will provide an overview of the tools that are available to scholars and practitioners for understanding these processes using these emerging forms of data.

Keywords: knowledge networks, information dissemination, social media, spatial networks, network science.

1 Overview

Thanks to the existing and emerging digital technologies, nowadays it is possible to trace networks of knowledge at a scale and at a level of granularity that were unimaginable just a few years ago. In this article we will discuss a series of studies concerning the analysis and mapping of networks of knowledge at different scales through a variety of sources of digital data. We will focus in particular on data from social media, large-scale digital repositories and mobile data.

For instance, networks of knowledge can be extracted by analysing the interactions happening in social network platforms. In particular, models of information diffusion can be defined to represent and understand the spreading of knowledge in groups of users in time and space (Nekovee et al., 2007; Liben-Nowell & Kleinberg, 2008; Lazer et al., 2009; Lerman & Ghosh, 2010; Kitsak et al., 2010; Dodds et al., 2011; Karsai et al., 2011; Pastor-Satorras et al., 2015). A typical way is to describe the process of dissemination of the information using epidemic models, where the process of diffusion does not involve a disease (Anderson & May, 1991), but a piece of information. In this way, it is possible to represent knowledge dissemination over networks as epidemic processes and formalise them by means of mathematical models (Keeling & Eames, 2005). One of the most fascinating aspects related to tracing networks of knowledge in the digital world is that it is possible to extract data in a digital format and use them to develop and test models of dissemination at a scale. Indeed, this is one of the most compelling examples of applications of big data outside the commercial world (Musolesi, 2014).

Diffusion models can be devised to understand how knowledge ‘spreads’ through a network of people in a community, in a society, in a city, in a country or even at planetary scale. Knowledge is usually defined as information acquired through experience or education. Therefore, tracing the diffusion of knowledge corresponds in a sense to tracing the diffusion of information (Hidalgo, 2015). We have witnessed the development of a variety of models for describing the spreading and the adoption of knowledge, such as, for example, the generation and acceptance of innovative ideas, starting from the marketing community (Goldenberg et al., 2001). Network scientists have also contributed to the fields through the definition of several models of increasing complexity (Iacopini et al., 2018). One of the most famous models is indeed the Bass model (Bass, 1969), which describes the adoption of innovation in a community by means of a very parsimonious mathematical model.

An interesting class of emerging models is that describing dynamical (Barrat et al., 2008), temporal (Tang, Scellato, Musolesi, Mascolo & Latora, 2010; Holme & Saramäki, 2012) and spatio-temporal networks (Williams & Musolesi, 2016). These models can be used for example to identify influencers and mediators in time-varying social networks (Tang, Musolesi, Mascolo, Latora & Nicosia, 2010). Spatio-temporal network models can also be used to represent the spreading of information in time and space through networks of people (or places) and study its dynamics.

2 Social media

Social media can be used to track the spreading of information as it happens in real-time at a very large scale. They are extremely popular, involving thousands of millions of users. The first example of information diffusion in the digital world indeed happens through Web blogs, which predate the advent of social media outlets such Facebook (Ellison et al., 2007) and Twitter (Kwak et al., 2010; Cha et al., 2010). One of the first study of information diffusion in these media can be found in (Gruhl et al., 2004; Adar & Adamic, 2005). The geography of Twitter has been analysed in several works, such as (Leetaru et al., 2013). Locative media has been the focus of investigation of several researchers, focussing on the interplay between our online and offline lives (Özkul, 2013; Özkul & Humphreys, 2015; Özkul, 2017). More in general, social media can contribute substantially to opinion formation, as studied for example in (Watts & Dodds, 2007).

In (De Domenico et al., 2013) the authors examine the spreading of rumours related to the discovery of the Higgs boson (The ATLAS Collaboration, 2012; The CMS Collaboration, 2012) in real-time from Twitter data. Twitter is a social network where a user can write microblog posts composed of 140 characters. A post can then be retwitted (i.e., reposted) by other users that are following a person or an organisation (followers). The availability of data from the platform also allows for the extraction of mathematical models of diffusion in the social network of followers across the globe. Furthermore, the study of the topics discussed in Twitter has attracted the attention of a vast number of researchers in the past years. For example, in (Romero et al., 2011) the authors analyse the adoption of hashtags and they show that for example those that are politically controversial are particularly persistent in the network over time.

Another interesting example of tracing knowledge using digital data is indeed the analysis of software developers’ activities in GitHub, an online software development collaborative tool. In (Lima et al., 2014) the authors examine the network of collaborations on different software projects at a global scale, looking at how collaborations unfold, including their geographic characteristics. In GitHub users can create code repositories; every repository has a list of collaborators, who can make changes to the content of the repository. A user can submit changes to the codebase or if they do not want to be a collaborator, there is the possibility of ‘forking’ a project. The action of fork creates a duplicate of the repository for independent work. By analysing the interactions on GitHub, it is possible to observe patterns of collaborations over time and over space. Indeed, software production is a creative process: collaboration on software projects can be seen as a new way of sharing knowledge and participating to the co-creation of intellectual artefacts.

One of the limitations of this body of work is the fact that often external factors cannot be captured, such as for example the influence of other social media or press outlets, e.g., television and newspapers and other online and offline media. An interesting study about evaluating the external influence of other sources on information spreading can be found in (Myers et al., 2012).

The diffusion of social media has also led scholars to ask questions related to the importance of distance not only in a world connected by global online networks (Mok et al., 2010), but also at city level (Hristova, Noulas, Brown, Musolesi & Mascolo, 2016). In (Scellato et al., 2010), by analysing four social networks (Brightkite, Foursquare, LiveJournal and Twitter), the authors show that distance still matters in the establishment of social links: we cannot talk about death of distance (Cairncross, 2001), at least yet, in the case of social media. The resulting geo-social networks show heavy-tailed degree distributions, a tendency to exhibit users with high node locality and social triangles on a local geographic scale. Hristova et al. showed that geo-social networks in different platforms are also highly correlated (Hristova, Noulas, Brown, Musolesi & Mascolo, 2016) and this fact can be used for effective link prediction.

Other interesting problems include the identification of influential spreaders of information (Kitsak et al., 2010), the maximisation of the spreading process itself (Kempe et al., 2003), and the impact of homophily in the diffusion of information (Aral et al., 2009). Thanks to the availability of geo-social networks, it is also possible to identify the key spreaders in space (Lima & Musolesi, 2012): this allows us to understand how specific people in a spatial network contribute to the dissemination of information and ideas in certain geographic areas, such as neighbourhood, cities or states. It is worth noting that the study of information diffusion in social media has not been limited to text but also focussed on other multimedia content such as photos in Flickr (Cha et al., 2009). It has been argued that digital systems are defining the space itself and are changing the way we live our everyday lives (Kitchin & Dodge, 2011). At the same time this rich set of information is collected and can be analyzed, also in real-time, reconstructing the dense fabric of interactions of millions of individuals.

Information can also be political in nature such as in the case of protests (González-Bailón et al., 2011, 2013; Margetts et al., 2015). For example, by retrieving information through social media, it is possible to map the evolution of protests, including the exchange of information, of groups operating at local and global level. This offers an unprecedented opportunity for studying and understanding political phenomena in real-time. Another very interesting aspect is that, for the first time, researchers are able to follow the spreading of information involving single actors and their thoughts at a very fine-grained granularity both in space and time. Moreover, it is worth noting that this information is not mediated through recollection, but it is instantaneous in nature, i.e., it is generated and can be collected as events unfold.

It is worth noting that social media data have been used for a very large number of studies, for example for understanding not only the spreading of information itself, but also the inherent contextual factors influence it, such as demographics and culture. For example, one of the topics of interest of social media studies has been the analysis of surnames, which can be used to understand the geo-demographics of different areas of large cities, such as London (Longley & Adnan, 2016). In this way, it is possible to trace back global networks of people and migrations reflected in the ethnic composition of large-scale cities. Geo-social networks can also be used effectively to trace and study diversity and gentrification phenomena (Hristova, Williams, Musolesi, Panzarasa & Mascolo, 2016). Finally, in (Silva et al., 2014) the authors discuss how data from the Foursquare platform can be used to understand cultural characteristics of geographic areas by looking at the presence of specific food and drinks venues.

3 Large-scale data repositories

Wikipedia, an online encyclopaedia that is completely edited by volunteers, is probably the most successful example of crowdsourced work. Through the analysis of Wikipedia usage in different languages and the application of data mining techniques to the content of the various articles, it is possible to reconstruct how different topics are of interest and relevant in different part of the globe (Brandes et al., 2009; Yasseri et al., 2012). Moreover, the articles themselves can be geo-located, i.e., mapped to different areas of the planet. In other words, it is truly possible to create maps of knowledge that connect people, objects, facts and places around the world. In order to extract information from the text usually natural language techniques (NLP) techniques (Manning & Schütze, 1999) are applied. With respect to the extraction of geographic information, geo-localisation techniques are used to map names to geographic named entities. The mapping is performed by trying to map names to geographic locations that are available in geographic gazetteers such as Geonames (Ahlers, 2013).

As a more general trend, we observe that citizens are getting more and more involved in the creation of knowledge, both contributing through text and multimedia content (images, videos, etc.) but also by acting as sensors and collecting information about the environments where they live (Goodchild, 2007). One of the most interesting examples is OpenStreetMap (OpenStreetMap, 2019). Through the collaboration of a large number of individuals, knowledge is created, shared and disseminated. Several researchers have been investigated how this collaboration takes place, characterising his dynamics (Hristova et al., 2013).

Other repositories that allow for the analysis of networks of knowledge include online literary repositories, such as for example Gutenberg (The Gutenberg project, 2019). The analysis of the networks emerging in the literature have been a theme of the work of Moretti among the others (Moretti, 2005). Indeed, one possibility is to try to extract names of geographic locations as discussed above for Wikipedia from text in order to reconstruct how literary works are mapped in the geographic space, in order to understand how the fictional worlds are related to the real ones.

Another interesting example is the tracing of influence through citation networks (Bilke & Peterson, 2001). These can be traced over both time and space and they provide a way for mapping out collaborations among individuals in different countries in a very detailed way considering exchange information across disciplines (Porter & Rafols, 2009; Sinatra et al., 2016) and in specific communities (Lehmann et al., 2003). An entire new academic discipline, usually referred to as “science of science” (Fortunato et al., 2018), focussing on understanding the dynamics of scientific collaboration and exchange of ideas is emerging.

4 Mobility data

Another interesting source of data is related to the increasing availability of mobility traces datasets, from flights (Guimera et al., 2005) to migration patterns (Davis et al., 2013) data, from social media (Noulas et al., 2012; Hawelka et al., 2014) to mobile data collected directly by means of users’ smartphones (Campbell et al., 2008; Zhao et al., 2015; Canzian & Musolesi, 2015) or by the cellular phone operators through the network infrastructure (Laurila et al., 2012).

In this case, the network of knowledge emerging from the data is in a sense indirect, but through this information we can map how people move at different geographic scale and spread ideas. Indeed, as discussed above, some people have talked about the death of distance (Cairncross, 2001) as a consequence of the emergence of digital technologies, including for example the use of smartphones and global network connectivity. However, this claim is disputed by other researchers that are stressing the fact that distance is still a major factor in human relationships and, therefore, in spreading of ideas (Scellato et al., 2010).

In general, one of the most difficult aspects of using mobility data for understanding spreading processes is the fact that data provide only a partial view of the interactions between people. Indeed, it is extremely difficult to collect ground-truth information for reconstructing interactions from these datasets. For these reasons, it is always very important to be extremely cautious in deriving universal models from these datasets in terms of human behaviour and interactions.

5 Tools

In this section, we will briefly review the tools that are used for tracing networks from digital data. We can broadly divide them into three main categories: data collection, data analysis and data visualisation.

First of all, data collection is a fundamental phase in the process of tracing networks of knowledge. Data is usually collected from digital sources such as the Web through scraping tools (Mitchell, 2015) or from digital platforms by means of Application Programming Interfaces (APIs) such as those provided by Twitter (Twitter REST API, 2019). A limited amount of data is usually available for free, such as in the case of Twitter.

Secondly, data analysis is performed with a variety of tools from stand-alone applications such as Gephi (Gephi - The Open Graph Viz Platform, 2019) to purpose-built software developed using existing libraries such as NetworkX (for Python) (Networkx Library, 2019) or igraph (available for a variety of programming languages) (igraph, 2019). There are also several libraries for spatial analysis, such as PySAL (PySAL, 2019) for Python.

Finally, as far as data visualisation is concerned, various tools can be used for representing the networks of knowledge extracted from the data at local, national and global scales. An emerging class of tools is represented by open source tools for visualisation. Examples include again Gephi (Gephi - The Open Graph Viz Platform, 2019) or Cytoscape (Cytoscape: An Open Source Platform for Complex Network Analysis, 2019). Network data visualisation is also performed through tools developed by means of ‘general-purpose’ programming languages and platforms for data manipulation and visualisation such as Processing (Reas & Fry, 2014). Other data visualisation tools include classic Geographic Information Systems (Longley et al., 2015), which provide functionalities for mapping networks in the geographic space and more advanced digital cartography visualisations (Cheshire & Uberti, 2014).

6 Outlook

In this paper, we have examined new emerging and fascinating ways of tracing networks of knowledge in real-time using the digital traces that we leave in our everyday lives. We have explored the new possibilities offered by the availability of a variety of new forms of data, including social media, large-scale data repositories and mobile data. These data sources allow for new types of investigation that were not possible just a few years ago. At the same time, it is important to consider the limitations that are inherent in using these new forms of data for mapping social phenomena. One of the key issues is indeed associated to the intrinsic bias that is associated to them For example, the demographics of social media users might not be representative of the general population (Mislove et al., 2011). However, it is worth noting that these sources are still valuable for understanding the behaviour of particular subsets of the population. In general, it is still possible to derive population-level observations taking into consideration at the same time the limitations of the data sources used for the analysis. Predictive models using techniques from Machine Learning and Artificial Intelligence are also opening up new possibilities in this field. For example, a very interesting and promising area is the creation of knowledge graphs, linking concepts, people and places (Nickel et al., 2015). The application of machine learning to the process of understanding processes of knowledge dissemination and its implications has also recently attracted the attention of social scientists (see for example (Mackenzie, 2017), in which the author develops an archaeology of machine learning operations following Foucault (Foucault, 2013)).

Other important considerations are related to the potential privacy issues associated with these new emerging sources of data (Musolesi, 2014; Rossi & Musolesi, 2015). It is worth noting that most of the data cited in this article are the results of aggregation, i.e., they preserve the identity of the single users.

Most of the existing work has been based on extracting, mapping and visualising networks of knowledge. We believe that modelling and predicting the dissemination of knowledge represent a very interesting and still open areas for researchers, also considering the recent results in both network science (Boccaletti et al., 2006; Barrat et al., 2008; Newman, 2010; Pastor-Satorras et al., 2015) and machine learning (Murphy, 2012; Goodfellow et al., 2016). This is indeed an area where truly interdisciplinary work is essential given the fact that the research questions in this field are really at the interface of the humanities, social sciences, computational and mathematical sciences.

Biographical Note

Mirco Musolesi is Professor in Data Science at the Department of Geography at University College London and Turing Fellow at the Alan Turing Institute. At UCL he leads the Intelligent Social Systems Lab. He received a PhD in Computer Science from University College London and a Master in Electronic Engineering from the University of Bologna. After postdoctoral work at Dartmouth College and Cambridge, he held academic posts at St Andrews and Birmingham. The research focus of his lab is on sensing, modelling, understanding and predicting human behaviour in space and time, at different scales, using the ‘digital traces’ we generate daily in our online and offline lives. He is interested in developing mathematical and computational models as well as implementing real-world systems based on them. This work has applications in a variety of domains, such as ubiquitous and autonomous systems design, healthcare and security&privacy.

Bibliography93

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Adar & Adamic (2005) Adar, E. & Adamic, L. A. (2005), Tracking information epidemics in blogspace, in ‘Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence’, IEEE Computer Society, pp. 207–214.
3Ahlers (2013) Ahlers, D. (2013), Assessment of the accuracy of geonames gazetteer data, in ‘Proceedings of the 7th Workshop on Geographic Information Retrieval’, ACM, pp. 74–81.
4Anderson & May (1991) Anderson, R. M. & May, R. M. (1991), Infectious diseases of humans: dynamics and control , Vol. 28, Oxford University Press, Oxford and New York.
5Aral et al. (2009) Aral, S., Muchnik, L. & Sundararajan, A. (2009), ‘Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks’, Proceedings of the National Academy of Sciences 106 (51), 21544–21549.
6Barrat et al. (2008) Barrat, A., Barthelemy, M. & Vespignani, A. (2008), Dynamical Processes on Complex Networks , Cambridge University Press.
7Bass (1969) Bass, F. M. (1969), ‘A new product growth for model consumer durables’, Management Science 15 (5), 215–227.
8Bilke & Peterson (2001) Bilke, S. & Peterson, C. (2001), ‘Topological properties of citation and metabolic networks’, Physical Review E 64 (3), 036106.