Mind the Gap: A Study in Global Development through Persistent Homology

Andrew Banman; Lori Ziegelmeier

arXiv:1702.08593·math.AT·January 12, 2018

Mind the Gap: A Study in Global Development through Persistent Homology

Andrew Banman, Lori Ziegelmeier

PDF

Open Access

TL;DR

This paper applies persistent homology, a topological data analysis technique, to study global development patterns using economic and health indicators, revealing hidden structures and relationships among countries.

Contribution

It introduces a novel application of persistent homology to analyze global development data, uncovering multi-scale patterns and geographic cycles.

Findings

01

Identification of localized development clusters

02

Discovery of cycles related to geographic borders

03

Revelation of hidden similarities among countries

Abstract

The Gapminder project set out to use statistics to dispel simplistic notions about global development. In the same spirit, we use persistent homology, a technique from computational algebraic topology, to explore the relationship between country development and geography. For each country, four indicators, gross domestic product per capita; average life expectancy; infant mortality; and gross national income per capita, were used to quantify the development. Two analyses were performed. The first considers clusters of the countries based on these indicators, and the second uncovers cycles in the data when combined with geographic border structure. Our analysis is a multi-scale approach that reveals similarities and connections among countries at a variety of levels. We discover localized development patterns that are invisible in standard statistical methods.

Tables6

Table 1. Table 1: Statistics of each indicator: GDP per capita (GDP), Life Expectancy (LE), Infant Mortality rate (IM), and GNI per capita (GNI). The first five statistics correspond to the raw data; the last corresponds to the attenuated and scaled data. Naturally, high GDP, GNI, and life expectancy are favorable, whereas high infant mortality rate is unfavorable.

Indicator	Max	Min	Median	Mean	Stand Dev	Scaled Mean
GDP	148374	599	11903	18972	21523	-0.476
LE	84.8	48.86	74.5	72.56	7.74	0.296
IM	96	1.5	23.89	15	21.9	0.528
GNI	87030	350	8360	13596	15399	-0.431

Table 2. Table 2: Countries comprising the largest connected components in the VR complex at filtration ϵ = 0.08 italic-ϵ 0.08 \epsilon=0.08 over ℝ 2 superscript ℝ 2 \mathbb{R}^{2} and the corresponding means of scaled indicators, GDP/capita (GDP) and life expectancy (LE), for each cluster. Clusters are listed in ascending GDP order, for clarity in comparison.

Countries (ISO2)	GDP	LE
\svhline Bangladesh, Kyrgyzstan, Cambodia, Mauritania, Micronesia Fed. Sts., Nepal, Syria, Gambia, Comoros, Myanmar, Sudan, Sao Tome and Principe, India, Laos, Marshall Islands, Guyana, Pakistan, Ghana, Nigeria, Yemen Rep., Djibouti, Kenya, Senegal, Tanzania, Vanuatu, Haiti, Liberia, Madagascar, Solomon Islands, Ethiopia, Rwanda, Benin, Kiribati, Burkina Faso, Burundi, Congo Dem. Rep., Niger, Papua New Guinea, Togo, Uganda, Zimbabwe, Eritrea, Mali, Malawi, Guinea, Cote d’Ivoire, Cameroon, Sierra Leone, Mozambique, Chad, Zambia, South Sudan, Guinea-Bissau, Fiji	-0.93	-0.15
Albania, Bosnia and Herzegovina, Colombia, Jordan, Sri Lanka, Tunisia, Peru, Macedonia FYR, Barbados, China, Dominican Rep., Algeria, Ecuador, Montenegro, Serbia, Thailand, Bulgaria, Brazil, Iran, Venezuela, Mauritius, Mexico, Romania, Argentina, Saint Lucia, Armenia, Jamaica, Paraguay, El Salvador, Morocco, Vietnam, Bolivia, Bhutan, Cape Verde, Georgia, Guatemala, Honduras, Moldova, Samoa, Belize, Ukraine, Indonesia, Philippines, Saint Vincent and the Grenadines, Egypt, Grenada, Tonga, Uzbekistan, Tajikistan, Korea Dem. Rep., Timor-Leste, Palestine	-0.69	0.44
Antigua and Barbuda, Croatia, Uruguay, Cuba, Panama, Turkey, Lebanon	-0.37	0.63
Estonia, Poland, Slovak Republic, Hungary, Latvia, Malaysia, Lithuania, Seychelles	-0.19	0.53
Cyprus, Malta, Slovenia, Israel, Spain, Italy, Korea Rep., New Zealand, Portugal, Greece	-0.02	0.83
Austria, Australia, Canada, Germany, Denmark, Netherlands, Sweden, Belgium, Taiwan, Finland, France, United Kingdom, Bahrain, Ireland	0.38	0.80

Table 3. Table 3: Generating countries of the South America cycle in the ℝ 2 superscript ℝ 2 \mathbb{R}^{2} -weighted graph from the dimension-1 barcode interval [ 0.34 , 0.62 ) 0.34 0.62 [0.34,0.62) in Fig. 3 .

Country	GDP	LE
Chile	-0.29	0.71
Peru	-0.63	0.72
Bolivia	-0.81	0.37
Brazil	-0.52	0.43
Argentina	-0.45	0.55

Table 4. Table 5: Cycle from Libya to Chad found in the country border graph with weight d I subscript 𝑑 𝐼 d_{I} , where I 𝐼 I ={GDP/capita (GDP), life expectancy (LE), infant mortality (IM), GNI/capita (GNI)}. Parsed from the interval persisting over [ 1.10 , 1.96 ) 1.10 1.96 [1.10,1.96) in Fig. 2 .

Country	GDP	LE	IM	GNI
Libya	-0.46	0.36	0.79	-0.28
Sudan	-0.89	0.05	0.02	-0.93
Chad	-0.95	-0.49	-0.77	-0.96
Niger	-0.99	-0.31	-0.18	-0.98

Table 5. Table 7: Countries composing generating cycles and the corresponding birth and death values representing the dimension-1 homology classes of the VR complex stream built over country border graph with weights d I subscript 𝑑 𝐼 d_{I} where I 𝐼 I ={GDP/capita (GDP), life expectancy (LE), infant mortality (IM), GNI/capita (GNI)}. Cycles are listed in ascending order of interval birth.

Birth	Death	Generating Countries
0.31	0.52	Hungary, Romania, Croatia, Montenegro, Serbia
0.46	0.94	Chile, Peru, Brazil, Argentina
0.53	0.96	Romania, Ukraine, Belarus, Poland, Hungary, Slovak Republic
0.54	0.94	Austria, Italy, Switzerland, Germany, France
0.56	0.75	Mali, Mauritania, Senegal, Guinea
0.71	0.85	Congo Dem. Rep., Zambia, Tanzania, Burundi
0.71	0.81	Kazakhstan, Turkmenistan, China, Kyrgyzstan, Uzbekistan
0.75	0.85	China, Nepal, Bhutan, India
0.78	0.85	Congo Dem. Rep., Uganda, Burundi, Tanzania
0.84	1.18	Czech Rep., Germany, Austria, Slovenia, Hungary, Slovak Republic
0.90	1.38	Congo Dem. Rep., Congo Rep., Central African Rep., Cameroon
0.91	0.96	Syria, Turkey, Iraq, Iran
1.06	1.95	Algeria, Mauritania, Sudan, Chad, Egypt, Niger, Mali, Libya
1.18	1.52	Israel, Jordan, Lebanon, Syria
1.22	1.85	Afghanistan, Turkmenistan, China, India, Tajikistan, Pakistan, Uzbekistan
1.24	1.51	Algeria, Niger, Mauritania, Mali
1.26	1.28	Afghanistan, Tajikistan, Turkmenistan, Uzbekistan
1.30	1.77	Iran, Pakistan, Afghanistan, Turkmenistan
1.34	1.49	Egypt, Israel, Jordan, Palestine

Table 6. Table 8: Country and the corresponding year of the most recently-available data for each indicator, GDP per capita (GDP), Life Expectancy (LE), Infant Mortality (IM), GNI per capita (GNI).

Country	GDP	LE	IM	GNI
Afghanistan	2015	2016	2015	2010
Albania	2015	2016	2015	2011
Algeria	2015	2016	2015	2011
Angola	2015	2016	2015	2011
Antigua and Barbuda	2015	2016	2015	2011
Argentina	2015	2016	2015	2011
Armenia	2015	2016	2015	2011
Australia	2015	2016	2015	2010
Austria	2015	2016	2015	2011
Azerbaijan	2015	2016	2015	2011
Bahamas	2015	2016	2015	2010
Bahrain	2015	2016	2015	2010
Bangladesh	2015	2016	2015	2011
Barbados	2015	2016	2015	2009
Belarus	2015	2016	2015	2011
Belgium	2015	2016	2015	2011
Belize	2015	2016	2015	2011
Benin	2015	2016	2015	2011
Bhutan	2015	2016	2015	2011
Bolivia	2015	2016	2015	2011
Bosnia and Herzegovina	2015	2016	2015	2011
Botswana	2015	2016	2015	2011
Brazil	2015	2016	2015	2011
Brunei	2015	2016	2015	2009
Bulgaria	2015	2016	2015	2011
Burkina Faso	2015	2016	2015	2011
Burundi	2015	2016	2015	2011
Cambodia	2015	2016	2015	2011
Cameroon	2015	2016	2015	2011
Canada	2015	2016	2015	2011
Cape Verde	2015	2016	2015	2011
Central African Rep.	2015	2016	2015	2011
Chad	2015	2016	2015	2011
Chile	2015	2016	2015	2011
China	2015	2016	2015	2011
Colombia	2015	2016	2015	2011
Comoros	2015	2016	2015	2011
Congo Dem. Rep.	2015	2016	2015	2011
Congo Rep.	2015	2016	2015	2011
Costa Rica	2015	2016	2015	2011
Cote d’Ivoire	2015	2016	2015	2011
Croatia	2015	2016	2015	2011
Cyprus	2015	2016	2015	2010
Czech Rep.	2015	2016	2015	2011
Denmark	2015	2016	2015	2011
Djibouti	2015	2016	2015	2009
Dominica	2015	2016	2015	2011
Dominican Rep.	2015	2016	2015	2011
Ecuador	2015	2016	2015	2011
Egypt	2015	2016	2015	2011
El Salvador	2015	2016	2015	2011
Equatorial Guinea	2015	2016	2015	2011
Eritrea	2015	2016	2015	2011
Estonia	2015	2016	2015	2011
Ethiopia	2015	2016	2015	2011
Fiji	2015	2016	2015	2011
Finland	2015	2016	2015	2011
France	2015	2016	2015	2011
Gabon	2015	2016	2015	2011
Gambia	2015	2016	2015	2011

Country	GDP	LE	IM	GNI
Georgia	2015	2016	2015	2011
Germany	2015	2016	2015	2011
Ghana	2015	2016	2015	2011
Greece	2015	2016	2015	2011
Grenada	2015	2016	2015	2011
Guatemala	2015	2016	2015	2011
Guinea	2015	2016	2015	2011
Guinea-Bissau	2015	2016	2015	2011
Guyana	2015	2016	2015	2010
Haiti	2015	2016	2015	2011
Honduras	2015	2016	2015	2011
Hungary	2015	2016	2015	2011
Iceland	2015	2016	2015	2011
India	2015	2016	2015	2011
Indonesia	2015	2016	2015	2011
Iran	2015	2016	2015	2009
Iraq	2015	2016	2015	2011
Ireland	2015	2016	2015	2011
Israel	2015	2016	2015	2011
Italy	2015	2016	2015	2011
Jamaica	2015	2016	2015	2011
Japan	2015	2016	2015	2011
Jordan	2015	2016	2015	2011
Kazakhstan	2015	2016	2015	2011
Kenya	2015	2016	2015	2011
Kiribati	2015	2016	2015	2011
Korea Rep.	2015	2016	2015	2011
Kuwait	2015	2016	2015	2010
Kyrgyzstan	2015	2016	2015	2011
Laos	2015	2016	2015	2011
Latvia	2015	2016	2015	2011
Lebanon	2015	2016	2015	2011
Lesotho	2015	2016	2015	2011
Liberia	2015	2016	2015	2011
Libya	2015	2016	2015	2009
Lithuania	2015	2016	2015	2011
Luxembourg	2015	2016	2015	2011
Macedonia FYR	2015	2016	2015	2011
Madagascar	2015	2016	2015	2011
Malawi	2015	2016	2015	2011
Malaysia	2015	2016	2015	2011
Maldives	2015	2016	2015	2011
Mali	2015	2016	2015	2011
Malta	2015	2016	2015	2010
Mauritania	2015	2016	2015	2011
Mauritius	2015	2016	2015	2011
Mexico	2015	2016	2015	2011
Micronesia Fed. Sts.	2015	2016	2015	2011
Moldova	2015	2016	2015	2011
Mongolia	2015	2016	2015	2011
Montenegro	2015	2016	2015	2011
Morocco	2015	2016	2015	2011
Mozambique	2015	2016	2015	2011
Namibia	2015	2016	2015	2011
Nepal	2015	2016	2015	2011
Netherlands	2015	2016	2015	2011
New Zealand	2015	2016	2015	2010
Nicaragua	2015	2016	2015	2011
Niger	2015	2016	2015	2011
Nigeria	2015	2016	2015	2011
Norway	2015	2016	2015	2011
Oman	2015	2016	2015	2010
Pakistan	2015	2016	2015	2011

Country	GDP	LE	IM	GNI
Palestine	2015	2016	2015	2005
Panama	2015	2016	2015	2011
Papua New Guinea	2015	2016	2015	2011
Paraguay	2015	2016	2015	2011
Peru	2015	2016	2015	2011
Philippines	2015	2016	2015	2011
Poland	2015	2016	2015	2011
Portugal	2015	2016	2015	2011
Qatar	2015	2016	2015	2011
Romania	2015	2016	2015	2011
Russia	2015	2016	2015	2011
Rwanda	2015	2016	2015	2011
Saint Lucia	2015	2016	2015	2011
Saint Vincent and the Grenadines	2015	2016	2015	2011
Samoa	2015	2016	2015	2011
Sao Tome and Principe	2015	2016	2015	2011
Saudi Arabia	2015	2016	2015	2011
Senegal	2015	2016	2015	2011
Serbia	2015	2016	2015	2011
Seychelles	2015	2016	2015	2011
Sierra Leone	2015	2016	2015	2011
Singapore	2015	2016	2015	2011
Slovak Republic	2015	2016	2015	2011
Slovenia	2015	2016	2015	2011
Solomon Islands	2015	2016	2015	2011
South Africa	2015	2016	2015	2011
Spain	2015	2016	2015	2011
Sri Lanka	2015	2016	2015	2011
Sudan	2015	2016	2015	2010
Suriname	2015	2016	2015	2010
Swaziland	2015	2016	2015	2011
Sweden	2015	2016	2015	2011
Switzerland	2015	2016	2015	2011
Syria	2015	2016	2015	2010
Tajikistan	2015	2016	2015	2011
Tanzania	2015	2016	2015	2011
Thailand	2015	2016	2015	2011
Timor-Leste	2015	2016	2015	2010
Togo	2015	2016	2015	2011
Tonga	2015	2016	2015	2011
Trinidad and Tobago	2015	2016	2015	2011
Tunisia	2015	2016	2015	2011
Turkey	2015	2016	2015	2011
Turkmenistan	2015	2016	2015	2011
Uganda	2015	2016	2015	2011
Ukraine	2015	2016	2015	2011
United Arab Emirates	2015	2016	2015	2011
United Kingdom	2015	2016	2015	2011
United States	2015	2016	2015	2011
Uruguay	2015	2016	2015	2011
Uzbekistan	2015	2016	2015	2011
Vanuatu	2015	2016	2015	2011
Venezuela	2015	2016	2015	2011
Vietnam	2015	2016	2015	2011
Yemen Rep.	2015	2016	2015	2011
Zambia	2015	2016	2015	2011

Equations8

d_{I} : R^{∣ I ∣} \to R

d_{I} : R^{∣ I ∣} \to R

d_{I} (x, y) = i \in I \sum (x_{i} - y_{i})^{2}

d_{I} (x, y) = i \in I \sum (x_{i} - y_{i})^{2}

A_{(i,j)}=\left\{\begin{array}[]{ll}1&{\rm if\ countries\ }{i,j\rm\ share\ a\ border},\\ 0&{\rm if\ countries\ }{i,j\rm\ do\ not\ share\ a\ border}\end{array}\right.

A_{(i,j)}=\left\{\begin{array}[]{ll}1&{\rm if\ countries\ }{i,j\rm\ share\ a\ border},\\ 0&{\rm if\ countries\ }{i,j\rm\ do\ not\ share\ a\ border}\end{array}\right.

D_{I(i,j)}=\left\{\begin{array}[]{ll}d_{I}(i,j)&{\rm if\ }{A_{i,j}=1},\\ \infty&{\rm if\ }{A_{i,j}=0}\end{array}\right.

D_{I(i,j)}=\left\{\begin{array}[]{ll}d_{I}(i,j)&{\rm if\ }{A_{i,j}=1},\\ \infty&{\rm if\ }{A_{i,j}=0}\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis · Complex Network Analysis Techniques

Full text

11institutetext: Andrew Banman 22institutetext: University of Minnesota, 3 Morrill Hall 100 Church St. S.E., Minneapolis, MN 55455, 22email: [email protected] 33institutetext: Lori Ziegelmeier 44institutetext: Macalester College, 1600 Grand Avenue, Saint Paul, MN 55105, 44email: [email protected]

Mind the Gap: A Study in Global Development through Persistent Homology

Andrew Banman and Lori Ziegelmeier

Abstract

The Gapminder project set out to use statistics to dispel simplistic notions about global development. In the same spirit, we use persistent homology, a technique from computational algebraic topology, to explore the relationship between country development and geography. For each country, four indicators, gross domestic product per capita; average life expectancy; infant mortality; and gross national income per capita, were used to quantify the development. Two analyses were performed. The first considers clusters of the countries based on these indicators, and the second uncovers cycles in the data when combined with geographic border structure. Our analysis is a multi-scale approach that reveals similarities and connections among countries at a variety of levels. We discover localized development patterns that are invisible in standard statistical methods.

1 Introduction

The Gapminder World GapminderWorld project provides a viewpoint of global development through a statistical lens. The first chart that loads in Gapminder plots each country’s gross domestic product (GDP) against the life expectancy of its citizens, see Fig. 1. The project equates GDP per capita with a nation’s wealth and life expectancy with its health. Countries are color-coded by their broad geographic region: the Americas, Eurasia, etc. A time lapse animation shows countries transitioning along a common trajectory towards more health and wealth, telling a common story about global development. However, it is not clear what role geography plays in this trend. While one may say that most African nations lag behind most Eurasian nations, it is difficult to draw any finer conclusions solely from these two statistics, as each region spans a large range of the development statistics. Furthermore, Gapminder’s pre-determined regions have been chosen according to a convention rather than from the data. For instance, it splits the African continent into Northern and Sub-Saharan regions, isolates India and a few of its neighbors, and joins Australia with Southern Asian countries. These regions do not necessarily align with regions of differing development.

We seek a quantifiable, fine-grained, and unbiased method to analyze development and geographic trends in this data. Persistent homology barcodes ; carlsson2009topology ; edelsbrunner2008persistent gives us tools to uncover the structure of high-dimensional, complicated data, revealing groups (connected components) and cycles (loops) in the data at multiple scales. Persistent homology has been used to understand the topological structure of data arising from applications including computer vision, biological aggregations, brain structure, among many others imagewebs ; windowsandpersistence ; hippocampalPH ; corticalsurfacePD ; visionTDA ; Swarms . In particular, the paper DBLP:journals/corr/StolzHP16 analyzes data related to the recent, so-called, “Brexit” referendum using persistent homology.

We use persistent homology to expand on Gapminder’s study of health and wealth statistics. We explore two methods (1) computing the connected components of the indicators of GDP per capita and life expectancy as well as infant mortality and gross national income per capita and (2) adding the underlying geography to the indicators by constructing a weighted graph based on country borders to observe cycles in the data. The structure of the data is uncovered at multiple scales. Our analyses reveal that there are connections among countries at a variety of levels and show subtleties with country similarities and differences, as well as loops formed by countries geographically linked. This provides a more nuanced view than simply the “first” versus “third” world paradigm, a construction that divides the world into discrete sets of developed and undeveloped countries nationsonline .

The remainder of this paper proceeds as follows. Background on the computational approach of persistent homology is discussed in Section 2. Section 3 outlines the indicators we use to quantify health and wealth of nations, and our implementation of persistent homology on these indicators. We analyze the results of these computations in Section 4. Conclusions and future work are discussed in Section 5.

2 Background on Persistent Homology

Persistent homology is a computational approach to topology that encodes a parameterized family of homological features such as connected components, loops, trapped volumes, etc of a topological space. It allows one to answer basic questions about the structure of point clouds at multiple scales. As such, it can uncover the “shape” of data. Broadly, this procedure involves (1) interpreting a point cloud as a noisy sampling of a topological space, (2) creating a global object by forming connections between proximate points based on a scale parameter, (3) determining the topological structure made by these connections, and (4) looking for structures that persist across different scales. For foundational material and overviews of computational homology in the setting of persistence, see edelsbrunner2008persistent ; Edelsbrunner10 ; barcodes ; carlsson2009topology ; computingPH .

Beginning with a finite set of data points, a nested sequence of simplicial complexes indexed by a parameter $\epsilon$ may be created by taking the vertices as the data points and forming a $k$ -simplex whenever $k+1$ points are pairwise within distance $\epsilon$ . This procedure is known as the Vietoris-Rips (VR) complex which is often used for its computational tractability barcodes . Fixing a field $\mathbb{F}$ , one builds a chain complex of vector spaces over $\mathbb{F}$ for each simplicial complex. For each pair $\epsilon_{1}<\epsilon_{2}$ , there is a pair of simplicial complexes, $S_{\epsilon_{1}}$ and $S_{\epsilon_{2}}$ , and an inclusion map $j:S_{\epsilon_{1}}\hookrightarrow S_{\epsilon_{2}}$ . This inclusion map induces a chain map between the associated chain complexes which further induces a linear map between the corresponding $k^{th}$ homology vector spaces. The dimension of the $k^{th}$ homology vector space is known as the $k^{th}$ Betti number $\beta_{k}$ and corresponds to the number of connected components, loops, trapped volumes, etc. of a simplicial complex for $k=0,1,2,\ldots$ , respectively.

The $k^{th}$ barcode is a way of presenting Betti numbers across multiple scales $\epsilon$ barcodes . From the barcode, one can visualize the number of independent homology classes that persist across a given filtration interval $[\epsilon_{b},\epsilon_{d}]$ as a function of the scale $\epsilon$ . See the top row of Fig. 3 for an example $\beta_{0}$ barcode and the bottom row of Fig. 3 for an example $\beta_{1}$ barcode. Each horizontal bar begins at the scale where a topological feature first appears (“is born”) and ends at the scale where the feature no longer remains (“dies”). The $k^{th}$ Betti number at any given parameter value $\epsilon$ is the number of bars that intersect the vertical line through $\epsilon$ . For $\beta_{0}$ in our setting, there will be a distinct bar for each data point at small values of $\epsilon$ , as the simplicial complex $S_{\epsilon}$ consists only of isolated points. At large values of $\epsilon$ , only one bar remains as all data will eventually connect into a single component.

The idea of persistence is to not only consider the homology for a single specified choice of parameter $\epsilon$ but rather, track topological features through a range of parameters. Those which persist over a large range of values are considered signals of underlying topology, while the short lived features are taken to be noise inherent in approximating a topological space with a finite sample carlsson2009topology .

3 Methods

There are many ways to quantify the health and wealth of nations. We study four development indicators: gross domestic product (GDP) per capita111Gross Domestic Product per capita by Purchasing Power Parities (in international dollars, fixed 2011 prices). The inflation and differences in the cost of living between countries has been taken into account worldbankGNI ., life expectancy222The average number of years a newborn child would live if current mortality patterns were to stay the same worldbankGDP ., rate of infant mortality333The probability that a child born in a specific year will die before reaching the age of one, if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births gbd2013 ., and gross national income (GNI) per capita444Gross national income converted to international dollars using purchasing power parity rates UNICEF .. These indicators were chosen because (1) we believe them to be broad indicators of health and wealth, and (2) recent data is available for a large set of countries in each indicator.

We consider this data in two sets: what we will call the four-dimensional ( $\mathbb{R}^{4}$ ) data comprising all four indicators and the two-dimensional ( $\mathbb{R}^{2}$ ) data comprising only GDP/capita and life expectancy. The raw $\mathbb{R}^{2}$ data—before scaling as discussed below—generates the Gapminder chart, see Fig. 1, allowing a comparison of our results to the chart.

The frequency of reporting and currency of statistics can vary dramatically by country so any result necessarily carries the “according to available data” qualifier. We construct our data sets by taking the most recent value for each indicator corresponding to a country555Most data comes from years 2015, 2016, with others as early as 2005. See Table 8 in Appendix A.. Countries with no available data for one or more indicators in this time frame are excluded from the data set. This yields data comprising 194 countries in the $\mathbb{R}^{2}$ set and 179 countries in $\mathbb{R}^{4}$ . See Table 1 for statistics such as the maximum, minimum, median, mean, and standard deviation for the raw data of the indicators.

We consider the relative health and wealth of countries, and the presence of extreme outliers in GDP obscures this relationship. Rather than exclude these countries outright, we modulate their values to two standard deviations from the mean. Alternatively, we could have taken the logarithm of GDP to bring the outliers closer to the bulk. However, this option has the undesirable consequence of exaggerating the distance between countries with very low GDP and understating the distance between higher GDP countries. For our purposes, it made more sense to collect the richest countries into one group at the extreme of the spectrum and likewise for the poorest. The same attenuation was done for the GNI per capita indicator.

Each indicator is then re-scaled to $[-1,1]$ . The range $[-1,1]$ was chosen to give a normative representation of each indicator, in which -1 is least favorable and 1 is most favorable, e.g. the country with lowest life expectancy has -1 for that dimension and the country with lowest infant mortality has 1 in that dimension. Note that this does not imply zero is the average value for any indicator. There are many more relatively low GDP countries, even after attenuating outliers, see Table 1. This scaling is required to ensure each indicator carries equal weight in the persistent homology calculations. Otherwise GDP/capita and GNI/capita would completely obscure any features in life expectancy and infant mortality rate because they are orders of magnitude larger in conventional units.

For our calculations, we use the TDA library in R TDApackage . This library provides an API to create a filtered simplicial complex upon which to calculate the persistent homology. The final result of the computation is a list of persistence intervals $[\epsilon_{b},\epsilon_{d}]$ , neatly displayed in a homology barcode, where each interval indicates a homological feature that is born at $\epsilon_{b}$ and dies at $\epsilon_{d}$ . In this section, we outline our procedure for computing persistent homology of our data. In the next, we analyze the results.

For our first experiment, we interpret each set of countries as a point cloud with each indicator value as a dimension. We then apply the Euclidean metric to define the distance between two countries $x$ and $y$ over a set of indicators $I$ :

[TABLE]

We use TDA to construct a stream of VR complexes from these point clouds over a range of filtration values $\epsilon\in[0,1.0]$ . Fig. 2 shows the zero-order and first-order barcodes of the VR streams for the two sets of indicators ( $\mathbb{R}^{2}$ on the left and $\mathbb{R}^{4}$ on the right).

For the second experiment, we add the geographic structure to the data by constructing a weighted graph over the countries and their borders. From country border data geonames , we define an adjacency matrix $A$

[TABLE]

from which we arrive at the distance matrix $D$ for a set of indicators $I$ ,

[TABLE]

where, for practicality, infinity is set to be a number larger than the maximum filtration value. This maximum is chosen to be large enough to display the entire set of intervals. We then compute the persistent homology of the explicit metric space defined by $D_{I}$ .666It has been observed that, for the VR complex, the metric in question need not actually be a metric as it is not a requirement to satisfy the triangle inequality JavaPlexTutorial . The construction described here is also known as a weighted rank clique complex. For example, see DBLP:journals/corr/StolzHP16 . The zero-order and first-order persistent homology barcodes for the weighted graphs over $\mathbb{R}^{2}$ data and $\mathbb{R}^{4}$ data are shown in Fig. 3. In this framework incorporating geographic structure, our focus is on the first-order features.

Generally, longer intervals are construed to represent more significant homology classes while short intervals are noise in the data. Statistically significant intervals can be quantitatively determined by the methods presented in fasy2014 . However, we shall see that even relatively short intervals in the first-order barcode reveal interesting patterns in the development indicators. On the other hand, intervals in Fig. 3 that persist through the full range of the filtration are less interesting to us as they relate to the inherent border graph structure. These “infinite” intervals in the dimension-0 barcode indicate island nations that share no borders with other countries. Since their distance to all other countries is infinite, they remain distinct components in the VR complex. The infinite intervals in the dimension-1 barcodes indicate homology classes inherent to the country border graph. The three infinite intervals in Fig. 3 identify the Black, Caspian, and Mediterranean seas. Fig. 3 has two additional intervals that exist because two countries (South Sudan and Zimbabwe) were dropped from the data set as not all four indicators were present, creating holes in the graph not unlike an inland sea. That these features are identified is a good sanity check for the method.

4 Parsing the Barcodes

4.1 Clustering of Development Groups

Zero-order persistent homology can be viewed as a clustering algorithm, where the connected components of a simplicial complex represent clusters in the data. In fact, these components are equivalent to the clusters of the hierarchical method of single-linkage clustering. In Fig. 1, we see a clustering chosen by Gapminder. In this section, we describe the clusters found using zero-order persistent homology present in the barcode of Fig. 2, focusing on the first experiment which only relies on distance between indicators and does not incorporate the country border information. In Appendix B, we present clusters selected by the classic $K$ -means algorithm. Each of these methods results in different clusters. However, we observe that viewing clusters at multiple scales and adding more indicators provides additional insight into relations among countries in terms of health and wealth.

We examine the clusters found using dimension-0 persistent homology by extracting the elements in each component of the simplicial complex for a particular filtration value, see the top row of Fig. 2. One may imagine drawing a vertical slice through the dimension-0 barcode at a given $\epsilon$ to select the components. We then extract the list of countries comprising each component using a union-find algorithm. The Betti number can be viewed as a function of the filtration value, $\beta_{k}(\epsilon)$ . When $\epsilon=0$ , each country is an isolated point, and hence, $\beta_{0}(0)=194$ for the $\mathbb{R}^{2}$ data and $\beta_{0}(0)=179$ for the $\mathbb{R}^{4}$ data. All countries in the point cloud eventually merge into a single connected component. This occurs at approximately $\epsilon=0.45$ for the $\mathbb{R}^{2}$ data and $\epsilon=0.92$ for the $\mathbb{R}^{4}$ data, as seen in the barcodes where only one bar remains.

Fig. 4 and Fig. 5 display the six777The choice of six is to coincide with the six clusters in the Gapminder project, see Fig. 1. components that contain the largest number of countries in the cluster at a variety of filtration scales for the $\mathbb{R}^{2}$ and $\mathbb{R}^{4}$ data, respectively. We further inspect these components in detail below.

First, we consider the large-scale structure of the data. For the $\mathbb{R}^{2}$ point cloud there are 170 countries in a single connected component at $\epsilon=0.14$ , eight countries in the next largest, and the remaining countries isolated in small components. We may say this large cluster is the dominant feature of the data. The $\mathbb{R}^{4}$ point cloud shows the same behavior. Fig. 4 and Fig. 5 show how quickly this dominant component grows at early filtration values. At no point do we observe two dominant clusters capturing a combined majority of countries.

Thus, the dimension-0 clustering shows that countries of the world may not be neatly divided into “first world” and “third world” categories with this method.888The clustering presented in Appendix B results in different clusters, which more closely align with this simplistic notion. The vast majority of countries are statistically quite similar to another country, which itself is similar to some other country, and so on. The result is a gradient in health and wealth statistics, rather than a discrete grouping. This is easily visualized in the Gapminder chart Fig. 1. One sees the countries of the world arrayed along a gradient from poorer countries with less longevity to richer, longer living countries. Persistent homology clustering captures this gradient as the resulting clusters from this method connect points to their nearest neighbors which each connect to their nearest neighbors and so on. This may result in long clusters whose elements at the ends of a cluster may be quite different from one another but are connected through their neighbors.

We also examine the small-scale structure by looking at smaller $\epsilon$ cross-sections. Fig. 4 and Fig. 5 show a sampling of clusters for early filtration values, before most countries are joined up into one dominant cluster. Consider the clusters in $\mathbb{R}^{2}$ at $\epsilon=0.08$ , shown in Fig. 4 and detailed in Table 4.1. While most countries fall into connected components of one to four countries, there are six larger components that capture 138 countries. Because these clusters only exist at a small scale, the countries in each cluster must be quite close in the data. Hence, we may conceive of these groups as sets of very similar countries according to the indicators. This clustering makes a distinction between groups of countries with varying GDP/capita and similar life expectancy. Observe clusters 2-4 have similar life expectancy but a wide range of increasing GDP. Likewise, clusters 5, 6 have almost the same LE but a 0.4 gap in GDP. From this result we may conclude there is nuance in development among poor countries that may be obfuscated by the ”third-world” identifier.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] The geonames geographical database. http://www.geonames.org. Accessed: 2017-01-28.
2[2] Henry Adams and Andrew Tausz. Javaplex tutorial. http://appliedtopology.github.io/javaplex/, 2017.
3[3] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society , 46(2):255–308, 2009.
4[4] Moo K. Chung, Peter Bubenik, and Peter T. Kim. Persistence diagrams of cortical surface data. In Information Processing in Medical Imaging , pages 386–397. Springer, 2009.
5[5] Yu Dabaghian, Facundo Memoli, Loren Frank, and Gunnar Carlsson. A topological paradigm for hippocampal spatial map formation using persistent homology. P Lo S computational biology , 8(8):e 1002581, 2012.
6[6] Herbert Edelsbrunner and John Harer. Persistent homology – a survey. Contemporary Mathematics , 453:257–282, 2008.
7[7] Herbert Edelsbrunner and John Harer. Computational topology: An introduction . American Mathematical Society, 2010.
8[8] Brittany T. Fasy, Jisu Kim, Fabrizio Lecci, Clement Maria, Vincent Rouvreau. The included GUDHI is authored by Clement Maria, Dionysus by Dmitriy Morozov, PHAT by Ulrich Bauer, Michael Kerber, and Jan Reininghaus. TDA: Statistical Tools for Topological Data Analysis , 2017. R package version 1.5.1.