High-resolution home location prediction from tweets using deep learning   with dynamic structure

Meysam Ghaffari; Ashok Srinivasan; Xiuwen Liu

arXiv:1902.03111·cs.SI·July 9, 2019

High-resolution home location prediction from tweets using deep learning with dynamic structure

Meysam Ghaffari, Ashok Srinivasan, Xiuwen Liu

PDF

Open Access

TL;DR

This paper presents a deep learning approach with a dynamic two-phase structure to accurately predict high-resolution home locations from social media tweets, significantly outperforming previous methods in accuracy and error reduction.

Contribution

The paper introduces a novel two-phase deep learning framework combining random forests and neural networks for high-resolution home location prediction from social media data.

Findings

01

Achieved over 90% accuracy on a large dataset.

02

Reduced high-resolution prediction error from over 21% to less than 8%.

03

Outperformed existing high-resolution methods in accuracy.

Abstract

Timely and high-resolution estimates of the home locations of a sufficiently large subset of the population are critical for effective disaster response and public health intervention, but this is still an open problem. Conventional data sources, such as census and surveys, have a substantial time lag and cannot capture seasonal trends. Recently, social media data has been exploited to address this problem by leveraging its large user-base and real-time nature. However, inherent sparsity and noise, along with large estimation uncertainty in home locations, have limited their effectiveness. Consequently, much of previous research has aimed only at a coarse spatial resolution, with accuracy being limited for high-resolution methods. In this paper, we develop a deep-learning solution that uses a two-phase dynamic structure to deal with sparse and noisy social media data. In the first…

Tables6

Table 1. TABLE I: The configuration of DNN-R and DNN-C

DNN-R Configuration	DNN-C Configuration
Input dimension: 10	Input dimension: 10
Dense1: Input: 10, output: 5, Activation: ReLU	Dense1: Input: 10, output: 5, Activation: ReLU
Dropout1: 0.30 dropout rate	Dropout1: 0.20 dropout rate
Dense2: output: 20, Activation: ReLU	Dense2: output: 20, Activation: ReLU
Dropout2: 0.30 dropout rate	Dropout2: 0.20 dropout rate
Dense3: Input: 20, output: 5, Activation: ReLU	Dense3: Input: 20, output: 5, Activation: ReLU
Dropout3: 0.30 dropout rate	Dropout3: 0.20 dropout rate
Dense4: output: 5, Activation: ReLU	Dense4: output: 5, Activation: ReLU
Dropout4: 0.30 dropout rate	Dropout4: 0.20 dropout rate
Dense5: Output: 1, Activation: Sigmoid	Dense5: Output: 2, Activation: Sigmoid

Table 2. TABLE II: Dataset description

Feature	Description
Check-in ratio	The ratio of the number of check-ins in a specific location by a user to the total number of check-ins at all locations by that user.
Daily total check-in rate	The average daily number of check-ins by the user.
End of day ratio	The ratio of the number of last check-in between 5PM-3AM of the day at a specific location to the same for all locations.
End of inactive day ratio	The ratio of the number of last check-in between 5PM-3AM of each weekend day at a specific location to the same for all locations.
Distance from most check-in location	The distance of a specific location from the most visited location by that user.
Midnight ratio	The ratio of the number of check-ins at a specific location between 12AM-7AM by a user to all check-ins during 12AM-7AM that user.
Number of check-ins at this location	Number of check-ins at this location by the user.
Total number of user check-ins	Total number of check-ins by the user.
Page rank	A graph measure to show the importance of each location. A node in a graph represents a location and the weight of a directed edge from u to v gives the number of times a user went from location u to location v. This measure considers the consequence of visited locations until 3AM of each day.
Reverse page rank	Is similar to page rank, but swapping the source and destinations.
User-ID	The unique ID assigned to each user in order to preserve privacy.
Is-home	Whether or not a record corresponds to the user’s home.

Table 3. TABLE III: Comparison of the new method and state of the art methods

Method	Reported accuracy (100 meters resolution)	Description $%$
Hu et al. [17]: SVM	$70.00$	For a subset of $76 %$ and $71 %$ of two different datasets
Kavak et al. [9]: using DBSCAN and SVM	$79.50$	Best reported accuracy on the whole population among prior methods. They use the same dataset as we do.
Tasse et al. [15]: using multilevel DBSCAN and Grid search	$56.90$	Reported result for 100 meter They also got $79 %$ for 1 KM resolution
Our model (DNN-R)	83.40	Best achieved accuracy for the whole dataset
Our model (DNN-R + DNN-C)	85.10	Reported results for $80 %$ of the users
Our model (DNN-R + DNN-C)	91.86	Reported results for $30 %$ of the users
Our model (DNN-R + DNN-C)	92.60	Reported results for $10 %$ of the users

Table 4. TABLE IV: The results of running random forest of DNN-R separately

Method	Reported accuracy (100 meters resolution)	training time
Random Forest	$79.38 %$	719s
DNN-R	$80.04 %$	9,485s
DNN-C	$72 %$	54,000s
DNN-R and DNN-C	$80.21 % - 83.82 %$	63,485s
Random forest, DNN-R and DNN-C	$84 % - 92.6 %$	26244s

Table 5. TABLE V: Number of first and second generation Puerto Ricans living in each PUMA zone. The Red Zones are in bold.

PUMA code	Close Neighborhood	Puerto Rican Generations: First + Second	Sum
08607	North East Airport	2814 + 1322	4136
08601	Miami Lakes	3320 + 752	4072
08603	North Miami Beach	3470 + 495	3965
08605	North Miami, Golden Glades	3760 + 87	3847
08616	West Miami	1774 + 2069	3843
08614	Key Biscayne	2623 + 926	3549
08613	Downtown	2887 + 378	3274
08611	Wynwood	3034 + 87	3121
08610	Miami Springs, Virginia Gardens	2798 + 230	3028
08602	Miami Gardens	2676 + 0	2676
08615	Coral Gables	2018 + 394	2412
08606	West Little River	2054 + 305	2359
08612	Miami Beach	1954 + 212	2166
08604	North Beach, Bal Harbour	2003 + 0	2003
08608	Air Port	1716 + 286	2002
08609	Hialeah	785 + 485	1270

Table 6. TABLE VI: Detected neighborhoods for Miami residents who visited Puerto Rico

Neighborhood	Percentage of users
Downtown	$25 %$
Miami Beach	$20 %$
Wynwood	$10 %$
Miami Airport	$10 %$
Allapattah	$10 %$

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Mobility and Location-Based Analysis · Data-Driven Disease Surveillance · Geographic Information Systems Studies

Full text

High-resolution home location prediction from tweets using deep learning with dynamic structure

††thanks: This material is based upon work supported by the National Science Foundation under Grant No. 1640822.

Meysam Ghaffari

Dept. of Computer Science

*Florida State University

*Tallahassee, US

[email protected]

Ashok Srinivasan

Dept. of Computer Science

*University of West Florida

*Pensacola, US

[email protected]

Xiuwen Liu

Dept. of Computer Science

*Florida State University

*Tallahassee, US

[email protected]

Abstract

Timely and high-resolution estimates of the home locations of a sufficiently large subset of the population are critical for effective disaster response and public health intervention, but this is still an open problem. Conventional data sources, such as census and surveys, have a substantial time lag and cannot capture seasonal trends. Recently, social media data has been exploited to address this problem by leveraging its large user-base and real-time nature. However, inherent sparsity and noise, along with large estimation uncertainty in home locations, have limited their effectiveness. Consequently, much of previous research has aimed only at a coarse spatial resolution, with accuracy being limited for high-resolution methods. In this paper, we develop a deep-learning solution that uses a two-phase dynamic structure to deal with sparse and noisy social media data. In the first phase, high recall is achieved using a random forest, producing more balanced home location candidates. Then two deep neural networks are used to detect home locations with high accuracy. We obtained over 90% accuracy for large subsets on a commonly used dataset. Compared to other high-resolution methods, our approach yields up to 60% error reduction by reducing high-resolution home prediction error from over 21% to less than 8%. Systematic comparisons show that our method gives the highest accuracy both for the entire sample and for subsets. Evaluation on a real-world public health problem further validates the effectiveness of our approach.

Index Terms:

deep neural network, dynamic structure, random forest, home location prediction, Twitter analysis, epidemics

I Introduction

Applications in diverse domains, including agriculture, transportation, poverty reduction, conflict prevention, disaster response, and humanitarian aid, require knowledge of the distribution of home locations of the population, or of specific demographic sub-groups, for effective public policy interventions [1, 2, 3]. The conventional approach in these fields is to use census data or data from surveys, such as the American Community Survey (ACS). However, these are conducted too infrequently to provide timely information. Moreover, ACS data has a resolution of a zone containing 100,000 people, which is too coarse for critical applications explained later.

New data sources, such as cell phone data records and GPS information, have been considered as alternative sources [1, 2]. However, the use of cell phone data has strict regulatory constraints and is not widely accessible. Moreover, its granularity is limited by the closest base transceiver station antennas to the user.

Social media data can potentially address the spatial and temporal challenges. Social media activities often use the device GPS to provide geotags with high-accuracy as metadata. Furthermore, social media has wide popularity. For example, Twitter has over 300 million active users worldwide. The use of social media is also increasing, with the number of tweets per day at 500 million in June 2018, in contrast to 400 million in March 2013 [4]. This offers the potential of obtaining real-time information on a large population sample with high spatial resolution.

These observations motivate the problem addressed in this paper. Given metadata for a large number of tweets, we wish to find home locations with 100m resolution for a subset of users, with high accuracy in the prediction.

Note that the American Community Survey, conducted by the US Census Bureau, has a sample size of around 2 million for each of the last few years. With approximately 67 million active Twitter users in the US, the ability to predict accurately for even 10% of the users would provide us with a sample that is several times that of the American Community Survey. Besides, Twitter would deliver results in real-time, in contrast to the annual reports published by the latter.

Despite the promise of Twitter data, there are also significant challenges arising from incorrect, imprecise, or missing information. In particular, the home location in the Twitter profile is optional. Hecht et al. have determined that only $42\%$ of the Twitter users report a valid city on their Twitter profile [18]. Furthermore, users often provide home location at the city level, which is not sufficiently precise for the class of applications that we consider. As an alternative, others have considered inferring home location from users’ check-in activities using the geotags of tweets. The challenge here is that users tweet at multiple locations, which makes it hard to pinpoint the precise user home location out of several locations that they may visit.

Given the above challenges, much of prior research has focused on predicting home location at the state and city levels. The few papers in the literature dealing with high-resolution predictions use Support Vector Machines (SVM) with a linear kernel for prediction, obtaining 70% accuracy for a 76% subset of the test population with 100m resolution. The highly imbalanced and complex nature of the data limits the efficacy of such an approach.

In this paper, we use a two-phase dynamic structure to manage the highly imbalanced and complex data effectively. In the first phase, we use a random forest designed to yield high recall to produce a more balanced set of records containing home candidates. In the second phase, using the more balanced sample, we train two different deep neural network models: Deep Neural Network for Regression (DNN-R) and Deep Neural Network for Classification (DNN-C). DNN-R is responsible for detecting user home location among available location records, and DNN-C is used for either approving or rejecting the detected record as the user home location, thus controlling the subset of data for which we provide a predicted home location.

By using a fast random forest algorithm to remove the majority of non-home records and then using more precise methods to improve accuracy, our approach yields the highest accuracy for high-resolution home location prediction from Twitter data for both the entire sample and for its subsets, obtaining up to 92.6% for a 10% subset and achieving up to 60% prediction error reduction in comparison to other methods. In addition, as an application, we used the proposed method to detect high-risk neighborhoods for the 2016 Zika epidemic importation from Puerto Rico to Florida and showed that it was substantially more effective than conventional ACS data [31].

The rest of the paper is organized as follows. We discuss related work on home location prediction and its applications in Section 2. We then describe our deep learning model in Section 3 and analyze its performance empirically in Section 4. We also demonstrate its effectiveness in detecting high-risk neighborhoods for Zika importation, compared with ACS data. We summarize our conclusions in Section 5 and present directions for future work.

II Related Work

As mentioned above, several applications need high-resolution home location [1, 2, 3]. The motivating application for us is the spread of vector-borne diseases such as Zika, Dengue, and Chikungunya. These are spread by mosquitoes of the Aedes genus which have a range of 100m-200m. If one can identify the home locations of people who recently visited regions experiencing an outbreak at such a granularity, then mosquito control measures can be cost-effectively deployed in those locations to reduce the likelihood of local disease spread [25]. It is sufficient for the mathematical models in these applications if we generate home locations of a sample of the population, provided it is large enough to capture the distribution of demographic groups of interest. As long as the sample size is sufficiently large, the primary goal is maximizing the accuracy [19].

Cell phone call data records (CDR) and GPS data [1],[11] have been used for home location prediction. However, they are not widely used due to limited resolution, regulatory constraints, and their high cost. In comparison, there had been much recent interest in leveraging the abundance of social media data to predict users’ home locations. For example, Backstrom et al. used Facebook users’ friends to predict their home locations with an accuracy of $69.1\%$ within a range of 25 miles [15]. Most research related to home-location predictions from tweets has focused on coarse spatial scales, such as time zone, state, and city. Mahmud et al. predicted user home location based on tweet geotags at the city, state and time zone levels with the accuracies of $58\%$ , $66\%$ and $78\%$ respectively [21]. Cheng et al. predicted the home location of users within 100 miles of their home with $51\%$ accuracy [22]. Pontes et al. also used Twitter geotags to detect user home locations at the city level with an accuracy of $82\%$ [6]. However, few researchers tried to predict at the fine-granularity that we target, until recently.

In recent work, Tasse et al. focused on predicting home locations at finer spatial scales than in prior work [15]. They predicted the home location with a resolution of 1 KM with $79\%$ accuracy and within 100m with $56\%$ accuracy. Hu et al. extracted a few features for users check-in patterns and improved the accuracy of home location prediction to $70\%$ for a $76\%$ subset of the data using a Support Vector Classifier (SVC) with linear regression [17]. Kavak et al. defined two additional features for users’ check-in patterns. They applied DBSCAN – a density-based clustering algorithm – to extract tweet locations for each user. Tweets with spatially close geotags from the same user are assigned to the same cluster [34]. A feature vector corresponds to each cluster as shown in Table II later. They then applied SVM with a linear kernel to train and test the model using 5-fold cross-validation. They achieved a best result of $79.5\%$ for predicting users home location within the range of 100m from their home [9]. Since this is the best-reported accuracy, and they made their dataset publicly available, we evaluated our work too on this dataset.

III Proposed Method

Tweets by a user may be from several locations. In our approach, we select one record which indicates the user’s home location out of multiple records that indicate places that a user visited. In this problem, the user’s home record is a minor class and other places the user visited is the major class. Disparity in the sizes of the two classes makes the problem unbalanced, which is exacerbated by the complexity of the data arising from travel pattern variations in the Twitter geotag dataset [28]. Dynamic structures are useful in problems with complex, unbalanced datasets, where it is not easy to detect the minority class [24]. We use a two-phase model. The first phase uses a simple algorithm with high recall, which runs fast and eliminates a significant number of records in the majority class to make the data for the next phase more balanced. The second phase, involves an effective but time-consuming algorithm to detect the minority class records precisely. Figure 1 shows the main components of the proposed method.

As shown in Figure 1, we first normalize the features. We then divide the dataset into a training set and a test set, and we train a random forest on the training set so that it can predict on the test set. We used 5 fold cross-validation on each time $80\%$ of the data is used for the training and $20\%$ for the test. After performing this phase with high recall, we obtain a significantly smaller dataset. The high recall ensures that we don’t miss many true home locations, while eliminating several records that are not the home. In the second phase, we again divide the newly created smaller dataset into training and test sets and train two different DNN models on the training set. The first DNN is a regression model (DNN-R) and the second DNN is a classification model (DNN-C). After training, DNN-R selects one record for each user that corresponds to the predicted home. This result can be used to obtain the precision for the entire test population. However, our focus is on identifying home locations of a subset of the population with high accuracy. We additionally use DNN-C to accomplish this as follows. DNN-R sends records detected as home locations to DNN-C. DNN-C classifies each record as the home with a certain probability. If this probability exceeds a threshold, then the record will be reported as a user’s home. Otherwise, that user’s home will be reported as unknown. We provide below further details of each step.

III-A Feature Normalization

Feature normalization is a standard step that ensures that all features are considered equally in the learning algorithm. It is accomplished using $X^{\prime}=2\times\frac{X-X_{min}}{X_{max}-X_{min}}-1$ of each feature X, where $X_{min}$ and $X_{max}$ are the minimum and maximum values respectively of that feature, and X’ is the normalized value. Feature normalization helps the algorithm to better model the dataset, and prevents bias toward one feature with high values. After normalization, we perform the first phase of our learning algorithm using a random forest.

III-B Random Forest

Random forests are based on decision trees. A decision tree is a supervised classifier which gets a set of data and creates a tree-like model of rules for classifying the data. The rules are based on the features of the training data [32]. In a random forest, instead of creating one tree to classify the dataset, several trees are created. Each tree uses just a subset of features, and also a subset of training data is used for training the model. In typical use, the majority of trees define the class of each record. But we differ from this as explained later. The biggest benefit of random forest over the decision trees is that it works on a bootstrap dataset with randomly selected features for each tree, and thus tries to prevent overfitting. The bootstrap dataset is created using randomly selected records of the dataset with replacement, so that it contains the same number of records as the original dataset. Some records may be selected multiple times due to this form of selection. The random forest has an error rate comparable to AdaBoost [29], but at the same time is more robust with respect to noise. Moreover, it is proven to work well on imbalanced data [8].

In the first phase of our model, we aim to eliminate non-home location records as much as possible. For this purpose, we use a random forest to classify the records as home or not home records. In order to have a high recall, we select every record that any tree in the forest predicted as the home location, rather than use the typical majority decision. We send the selected records in this phase to the second phase.

III-C Deep Neural Network for Regression (DNN-R)

In the second phase of the algorithm, we designed and applied a deep learning model – a multi layer perceptron. The configuration of the sequential fully connected deep neural network that we used is shown in Table I. In this model, five dense layers have been used and input data has 10 features. In order to include non-linearity in the model, we used Rectified Linear Units (ReLU) activation functions in the first 4 layers [10] and sigmoid in the last layer.

In order to prevent overfitting and to improve the generalization in the model, we applied dropout layers after each of the first four dense layers. The dropout randomly changes the weight of some neurons with the predefined probability to 0, thus preventing overfitting [26]. We used a regression model with a Stochastic Gradient Descent (SGD) optimizer and mean square error loss function. SGD is an iterative method for finding the optimum point of differentiable functions. Then we trained this model on the selected records in the first phase to detect the user home location where the home has the value of 1 and other records have the value of 0 in the target values. After training this model on the training set, we used it on the test set. For each user, the record with the highest prediction value is considered as the home location.

III-D Deep Neural Network for Classification (DNN-C)

The second deep learning model has a similar configuration with two differences, as seen from Table I. Instead of regression, it is designed for classification. So, we used the categorical cross-entropy as the loss function and the ‘RMSprop’ optimizer [30]. For each weight, this optimizer divides the learning rate by considering a running average of the magnitudes for the recent gradients pertinent to that weight [27]. Furthermore, since the algorithm is a categorization algorithm having two classes, the last layer has two outputs for two different classes.

In the last phase, by comparing the prediction value of DNN-C with a threshold, we verify the results of the DNN-R. The result will be reported only when the prediction value is higher than the threshold. Thus, instead of predicting the home location of all users, we predict the home location for a subset of users, but with higher accuracy.

IV Results and Analysis

IV-A Description of Data Set

We used a well-curated dataset prepared by Kavak et al. [9]. Their data was gathered using Twitter streaming API from May 2014 to April 2015 for the city of Chicago, Illinois. They performed anonymization to preserve the privacy of the users and then ran DBSCAN to cluster together tweets that are in close proximity, with the distance range specified as 100m. Each record in the final database relates to tweets from a particular user at a particular location with a 100m spatial resolution. For validation purposes, the true home location was determined by obtaining confirmation from the users about whether they tweeted from home. The final dataset has 78,812 records for 1268 users. The features of this dataset are listed in Table II [9].

IV-B Experimental Setup

We used an identical dataset and test procedure as [9] in order to ensure a fair comparison. We used 5-fold cross-validation in both phases of our model identical as state of the art, which means that the model is trained using $80\%$ of the dataset and validated using the remaining $20\%$ in each of five experiments. Both the DNN phases use the same training data in each test, and similarly the same test data. Note that records are selected into these sets based on the user. Consequently, all records for a specific user will either go into the training set or into the test set in any single experiment.

IV-C Results of First Phase (Random Forest)

We use a random forest with 500 trees. The random forest predicts each record as user’s home location with a prediction value between 0 and 1. This prediction value is the fraction of decision trees that considered that record as a user’s home. We can select a record as a candidate for a user’s home location if its prediction value exceeds a predefined threshold. In order to have high recall, we selected threshold as 0.002 in the forest with 500 trees; so, even if one tree in the forest predicts the record as home we select it for the next phase.

As shown in Figures 2 and 3, selecting a higher threshold significantly decreases the recall, especially when the threshold is close to 1. Since we are looking for high recall, while also pruning the records that are clearly not a user’s home location, we considered a threshold close to the zero. As shown in Figure 3, the highest recall is obtained by having $17.5\%$ of the records for the second phase and selecting fewer records will decrease the recall, which is not favorable to our goal. We used a small threshold, with a record being selected if even one tree selected it. This corresponds to a threshold of 0.002 in a forest with 500 trees. This yields a recall of $95.97\%$ with an average of 10.7 records per user, which is a substantial reduction over the roughly 62 records per user in the initial dataset. The selected records in this phase are sent to the second phase.

IV-D Configuring the DNNs

One of the important configurations of the DNN is the value of the dropout. As mentioned earlier, dropout is a useful technique for generalization. In order to find the best value for the dropout, we checked different values of dropout on the DNN-R. Figure 4 Shows the effect of the dropout on the result. Based on this figure, we chose $0.30\%$ for the dropout in the DNN-R.

As we can see in this figure, increasing the value of the dropout at the beginning improves the results, and then make it worse. The reason is that using small value of dropout such as $0.20$ or $0.30$ prevents the algorithm from overfitting. But if we use a high value for the dropout such as $0.90$ , it makes the weight of the majority of the neurons to 0 which does not let algorithm to learn any pattern and makes it work randomly.

Choosing an appropriate number of iterations for SGD is important in deep learning. If the number of iterations is too low, then it prevents the algorithm from fitting the data well and learning the pattern, while a high number of iterations can cause overfitting and decrease the generalization. We checked different numbers of iterations to find the best number for the algorithm. Figure 5 shows the result of different number of iterations. Beyond 20 iterations, the results are not very sensitive to the number of iterations, with best results for 50 iterations. We used 50 iterations in our experiments.

IV-E Results of the Second Phase (DNN-R and DNN-C)

In the second phase, we have two types of results. First, the reported results based on DNN-R show the accuracy of home location prediction for all users in the dataset. Second, the results of DNN-R combined with DNN-C which show the accuracy of home location prediction for a subset of users but with higher accuracy.

After training DNN-R we apply it to the test set. For each user in the test set, we consider the record with the highest predicted value of the DNN-R model as the user home location. The results of predicting the home location of users are depicted in Table 5 and compared with the state of the art results. We can see that our model improves on the prediction for the whole population in the test set over prior methods. However, our primary goal is to obtain high accuracy in a subset of the population, which requires one more step, as described below.

In the final step, we use the record selected as a user’s home location by DNN-R, and confirm whether this is true using DNN-C. DNN-C will provide a predicted score for each record provided to it. If this score exceeds a predefined threshold, then we will report that record as the user’s home. Otherwise, we will report that user’s home as unknown. Table III shows that this approach increased the home location prediction accuracy up to $92.6\%$ on a subset of users, which is significantly higher than the results of DNN-R for the entire test population, which averaged $81\%$ over the 5 tests, with a maximum of $83.4\%$ on one of those tests. Furthermore, this accuracy substantially exceeds those for prior results on subsets of the population. This accuracy is remarkable if we consider that Tasse et al. found that over $10\%$ of Twitter users did not have tweets within a range of 100m of their home location [15]. Consequently, the maximum possible accuracy for the entire population is less than $90\%$ .

Figures 6,7, and 8 shows the effect of the threshold on the accuracy and fraction of the population for whom our model can predict the home location. There is a trade-off between accuracy and the fraction of population for whom we can predict the home location. One can obtain the highest possible accuracy of around $92.6\%$ using $10\%$ of the total population. However, we can obtain a much larger sample – $30\%$ – without a substantial drop in accuracy, maintaining it at over $90\%$ . Given the large number of Twitter users, our method can yield a large sample with good accuracy.

As shown in the Figure 6, the accuracy increase with increase in the threshold. But the rate is not fixed. The reason is that by increasing the threshold in DNN-C, more records will be pruned. These records can belong into both true classified and wrong classified and their distribution is not the same. This means each group can have different records with different prediction value using DNN-C. However, in general it is effective and will increase the accuracy up to more than $92.6\%$ .

The same happens in Figure 8. When we prune more records, we will have higher accuracy for a smaller portion of the users. In this figure, we have the accuracy of $92.6\%$ but for $10\%$ of the users. If we want to report more users home location, we should decrease the threshold and this will lead to increasing percentage of users and decreasing the accuracy for them.

We expect our method to also be effective on other datasets because we have taken steps to ensure robustness, to avoid overfitting. For example, we used dropout in the deep neural networks. Moreover, we used 5-fold cross validation and performed the experiments at least 25 times. Thus, the whole dataset has been tested 5 times. The reported results are the average accuracy from these tests. The individual result did not vary much, and so the average is a reasonable reflection of typical performance.

IV-F Analysis of Each Component

We next analyze the contribution of specific model components to accuracy and training time. Table IV shows the results of the random forest and DNN-R. As we can see, using the random forest in the first phase decreases the execution time and also improves the final result as explained. We can also see that the result of each separate component is not as good as the mixed ones, which shows us that the dynamic structure is effective. In particular, the random forest decreases the training time and the balanced data that it produces enhances the effectiveness of the remaining components.

IV-G Results in Detecting High Risk Neighborhoods in Zika 2016

Here, we discuss the application of our method on a public health problem. We show that our method yields better results than use of conventional data sources for this application. In 2016, a Zika virus outbreak in Florida occurred through importation from the Caribbean and South America, with Puerto Rico playing a major role. Consequently, several Zika cases had been reported in Miami. They were mostly imported, though there was subsequent local spread. In 2016, CDC announced three Zika red zones of the order of a square mile each in Miami, which were Miami Beach, Wynwood, and Little River[31]. The conventional approach to identifying high-risk neighborhoods before an outbreak would detect places in Miami where persons with a connection to Puerto Rico lived. This relies on the assumption that such individuals were more likely than the general population to have visited Puerto Rico, which was experiencing a major outbreak, and been exposed to the virus there. ACS data provides the number of persons with origin in Puerto Ricans living in each Public Use Microdata Areas code (PUMA). Table V shows this based on 2016 data. As we can see, the red zones cannot be easily identified from this data.

Alternatively, we can use our algorithm to find high risk neighborhoods in Miami as follows. We extracted tweets of more than 500,000 users using Twitter API. We wished to generate a sample of users biased toward individuals in Florida with a connection to Puerto Rico. So, we started with few popular Twitter accounts in Puerto Rico and extracted their followers, the followers’ followers, and so on, and kept those with profile data indicating location in Florida. We applied our model, trained on the earlier data set from Chicago, to find the home location of people who lived in Miami and had visited Puerto Rico. The results are shown in Table VI.

We detected five neighborhoods that could be at risk, which included two of three red zones announced by CDC [31]. These neighborhoods are not among the neighborhoods with the highest Puerto Rican connection based on ACS data, which shows the efficiency of our new method and of social media data in identifying high-risk neighborhoods. Little River was a red zone that our method missed. Florida Department of Health confirmed to us, after we provided them our result, that this neighborhood did not have significant importation from Puerto Rico.

V CONCLUSION AND FUTURE WORKS

In this paper, we focused on predicting the home location of subsets of Twitter users with high-resolution and high accuracy. We performed this task using a dynamic structure. A random forest was used to provide better balanced data. We then applied two different deep neural networks, one for prediction and the other one for validation. Using the DNN-R, we obtained an average $81\%$ accuracy for the whole population with the best result being over $83\%$ , which is higher than the state of the art methods. More importantly, we obtained up to $92.6\%$ accuracy for a subset of Twitter users by using DNN-C to prune some of the results. This offers a variety of applications the option of obtaining real time home location data with fine spatial resolution. We then demonstrated the practical effectiveness of our method and of social media data by identifying neighborhoods at high risk of importing Zika from Puerto Rico in 2016. Our approach was much more effective than the conventional approach using ACS data.

One direction for future work is to use this technique in other applications mentioned earlier. Another direction is to increase the accuracy of the technique. For example, adding additional features could improve the accuracy of the algorithm. In addition, alternate algorithms for high recall in the first phase can impact the performance of the other phase. We will also explore improved machine learning approaches for the second phase.

Acknowledgement

The authors thank Danielle Stanek and Andrea Morrison from the Florida Department of Health and Kelly Deutsch from the Orange County (Florida) Mosquito Control Division for information on the 2016 Zika outbreak and vector control challenges respectively.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Jones KH, Daniels H, Heys S, Ford DV. Challenges and Potential Opportunities of Mobile Phone Call Detail Records in Health Research: Review. JMIR Mhealth Uhealth, 6:e 161, 2018.
2[2] Peak CM, Wesolowski A, Erbach-Schoenberg EZ, Tatem AJ, Wetter E, Lu X, Power D, Weidman-Grunewald E, Ramos S, Moritz S, Buckee CO, and Bengtsson L. Population mobility reductions associated with travel restrictions during the Ebola epidemic in Sierra Leone: use of mobile phone data. International Journal of Epidemiology, vol. 47, pp. 1562–1570, 2018.
3[3] Wesolowski A, Qureshi T, Boni MF, Sundsoy PR, Johansson MA, Rasheed SB, Engo-Monsen K, and Buckee CO. Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proceedings of the National Academy of Sciences, vol. 112, 11887-11892, 2015.
4[4] Mahmud, Jalal, Jeffrey Nichols, and Clemens Drews. ”Home location identification of twitter users.” ACM Transactions on Intelligent Systems and Technology (TIST) 5.3 (2014): 47.
5[5] https://www.omnicoreagency.com/twitter-statistics/ [26/10/18]
6[6] Pontes, Tatiana, et al. ”Beware of what you share: Inferring home location in social networks.” Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. IEEE, 2012.
7[7] Cho, Eunjoon, Seth A. Myers, and Jure Leskovec. ”Friendship and mobility: user movement in location-based social networks.” Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
8[8] Chen, Chao, Andy Liaw, and Leo Breiman. ”Using random forest to learn imbalanced data.” University of California, Berkeley 110 (2004): 1-12.