CITY-SCALE HUMAN MOBILITY PREDICTION MODEL BY INTEGRATING GNSS TRAJECTORIES AND SNS DATA USING LONG SHORT-TERM MEMORY

: Human mobility analysis on large-scale mobility data has contributed to multiple applications such as urban and transportation planning, disaster preparation and response, tourism, and public health. However, when some unusual events happen, every individual behaves differently depending on their personal routine and background information. To improve the accuracy of the crowd behavior prediction model, understanding supplemental spatiotemporal topics, such as when, where and what people observe and are interested in, is important. In this research, we develop a model integrating social network service (SNS) data into the human mobility prediction model as background information of the mobility. We employ multi-modal deep learning models using Long short-term memory (LSTM) architecture to incorporate SNS data to a human mobility prediction model based on Global Navigation Satellite System (GNSS) data. We process anonymized interpolated GNSS trajectories from mobile phones into mobility sequence with discretized grid IDs, and apply several topic modeling methods on geo-tagged data to extract spatiotemporal topic features in each spatiotemporal unit similar to the mobility data. Thereafter, we integrate the two datasets in the multi-modal deep learning prediction models to predict city-scale mobility. The experiment proves that the models with SNS topics performed better than baseline models.


INTRODUCTION
Today, more than 54 percent of the world's population lives in urban areas (66 percent by 2050) 1 . Cities are full of opportunities and services that attract new people. However, cities are also the places with the most serious urban issues such as problems relating to transportation, public safety, and public health. Emerging cities are facing, and will face, unprecedented problems that need to be solved with new ideas and technologies.
Urban dynamics and large-scale human mobility has been major challenges in Urban computing (Zheng et al., 2014) and in recent years, analyzing, mining and visualizing geospatial big data from new sources for decision-support, is also considered to be some of the most important challenges in this era of big data .
The high penetration rate of mobile phones, especially smart phones, enables systematic data collection for longer periods; moreover, users are more willing to connect to multiple services and allow service providers to collect user data (Birenboim and Shoval, 2016). These provide an inexpensive means of collecting data on city-scale mobility.
Traditionally, survey-based human mobility data collect certain background information on mobility. They often include individuals' mobility purpose, home location, household size, and job classification, which complement mobility models and theories.
Our main idea is to develop a city-scale human mobility prediction model by integrating GNSS trajectories and social network service (SNS) data, in order for the human mobility prediction model to achieve more accurate predictions and to have additional application capabilities.
The key idea of this work is summarized in Figure1 and the work offers the following key contributions, which highlight its uniqueness compared to previous research.
Our key contributions of this paper are as follows: • Integration of background information for human mobility: the model incorporate spatiotemporal topics from Twitter data as the background information of human mobility, • Efficient multi-source data integration: multi-source data integration requires novel prediction model structure for efficient learning of inter-source relationship. Mobile phone GNSS trajectory and geo-tagged SNS data are produced differently. Therefore, the resulting difference in the spatiotemporal distributions makes data integration challenging.
The rest of this paper is organized as follows. The related work is discussed in Section 2. We introduce the data and preliminaries for the model in Sections 3 and 4, respectively. Thereafter, in Section 5, we define our prediction scope and experimentally evaluate our methods using real data in Section 6; finally, Section 7 concludes the paper.

City scale human mobility analysis and modeling
There have been several studies with location logs and trajectories starting from survey-based (Grinberger andShoval, 2015, Spaccapietra et al., 2008) to large-scale mobile phone data for cityscale or nation-scale mobility (Calabrese et al., 2013, Kang et al., 2012. There have been several significant areas of analysis and data mining using mobility data. One typical area is urban analysis, where researchers discover features and evaluate problems in cities. Studies on this topic have been usually motivated by the urban planning perspective such as population estimation (Xu et al., 2016), mobility pattern (Reades et al., 2009), and discovering urban functional zones (Yuan et al., 2012).

Human mobility prediction
Specifically, mobility prediction focuses on providing information for future decision making. Initially, having large scale human mobility data such as mobile phone GNSS trajectory, the trajectory mining approach enables human mobility prediction models. Song et al. (Song et al., 2010) developed a city-scale human mobility prediction models and showed the high prediction accuracy of human mobility in regular days. More specific prediction scheme, such as disaster evacuation behavior with supplemental disaster information (Song et al., 2014), and taxi travel time estimation (Tang et al., 2016) are also gaining popularity. These specific schemes usually limit the scopes but achieve better accuracy for particular purposes.

Location based social network (LBSN)
Even though less than 1% of tweets in Twitter are geo-tagged (Morstatter et al., 2013) the mobility data from LBSN shows several biases such as location bias (sparsity), time bias (more data in evening), and attribute bias (more young people and wealthy region) (Mota et al., 2015), The Twitter data has been one of the most widely used LBSN 2 .
In addition to a typical application of using LBSN data as location data (Sakaki et al., 2013), Another application is to use LBSN data as a background information. LBSN data potentially have rich information on people's sentiments, observations, and thoughts. Some notable examples are discovering traffic anomalies (Pan et al., 2013), topic association among cities (Liu et al., 2016), and population density estimation by combining GNSS point density and topics from tweets (Miyazawa et al., 2019).

Deep Learning on Urban Computing
One of the most significant advancements in Machine Learning is Deep Learning. Urban computing community has indeed adopted the technology for many applications. For mobility prediction, one of the most important concepts is recurrent neural network and especially its extension long short-term memory network (Hochreiter and Schmidhuber, 1997), which enable time series input/output. Significant studies on this application includes traffic flow prediction using taxi GNSS trajectory (Niu et al., 2015), anomaly detection on pedestrian accidents (Zhou et al., 2018), and Region of Interest (ROI)-based short-term human mobility prediction (Jiang et al., 2018a). Another important characteristic of Deep Learning is "multimodality", which enable a model to combine different modal data structures by concatenating input vectors or linking network units. Ngiam et al. introduced multimodal deep learning and demonstrated its superiority over unimodal structure in their experiments (Ngiam et al., 2011). Zheng et al. introduced a multi-source model to detect urban anomalies that are not found by only using one data source ; they later elaborated the concept with a comprehensive review (Zheng, 2015). There has been an increasing popularity and interest in multimodal deep learning in urban computing communities (Song et al., 2014).
Compared to the aforementioned related work, we intend to apply the methodologies of Feature Extraction and LBSN on the 2 Twitter disables "precise location" attached to tweets in 2019.
Deep Learning-based application to improve the accuracy while adding analytical capabilities for the prediction results. Also, we apply the model on a regular day to evaluate the general analytical capability.

DATA
3.1 City-scale interpolated and anonymized mobile phone GNSS trajectory GNSS trajectories used in this study are from anonymized mobile phone users throughout the Greater Tokyo Area from July 1, 2012 to July 31, 2012, which are processed by NTT DOCOMO, INC. Now, NTT DOCOMO INC. collected an anonymous GNSS log dataset, "Konzatsu-Tokei (R)" Data in Japan over a three-year period (Aug 1, 2010 to July 31, 2013). "Konzatsu-Tokei (R)" Data refers to data on the flow of people collected by individual location data sent from mobile phones with the users' consent, through Applications 3 provided by NTT DOCOMO, INC. These data are processed collectively and statistically in order to conceal the private information. Original location data are GNSS data (latitude, longitude) sent in a minimum recurring period of approximately five minutes; however, they do not include information on individuals. One important drawback of the data is that the number of points is biased toward daytime (people are awake and moving) and down towns (greater population and activities involving mobility).
Additionally, each trajectory is processed through map matching on transportation network and spatiotemporarily interpolated to have a minute interval, and each point is on a transportation network link or node. It also has estimated transportation mode labels (Table 1). This interpolation improved the human mobility prediction result in the previous work (Song et al., 2016). 1  Stay  2  Walk  3  Bicycle  4  Car  5  Train  6 Unknown Table 1: Estimated mobility mode labels from a previous study (Song et al., 2016).

Geo-tagged tweets
The geo-tagged tweets from July 1, 2012 to July 31, 2012 were collected using Twitter API. To extract tweets concerning mobility and social activity, only tweets posted from check-in services (e.g. Swarm: a LBSN service by Foursquare Labs Inc.) are used in the following experiment.

Preprocessing text
The preprocessing is the essential in natural language processing and we follow our previous work (Miyazawa et al., 2019). After the text cleaning to remove noise and irrelevant text, we conduct word segmentation and normalization for part-of-speech (POS) tagging, then finally we remove "stop words" such as Japanese particles, auxiliary verbs, and pronouns. to optimize the learning process.

Human mobility trajectory
In this study, similar to the human mobility data, GNSS trajectories from mobile phones are employed. The raw GNSS trajectory is structured as a set of 4-tuple (1): where user, timestamp, lat, and lon are user ID, timestamp, latitude, and longitude, respectively. Typically, GNSS modules in mobile phones creates records in a certain time interval, which varies depending on the scope of the application. For example, navigation applications would measure and record the location in every few seconds, as the applications require frequent location measurement for accurate and precise navigation. In contrast, personal logging applications would measure and record the location in every few minutes, and sometimes it only records the location if some movement is detected in order to save batteries. Additionally, most applications can only record the locations when GNSS or a cell signal is available. Therefore, the timestamp in raw GNSS trajectory usually has no stable interval. If people are in an indoor environment or underground, there would be missing trajectories in the individuals GNSS trajectory.
Additionally, the interpolated GNSS trajectory (Song et al., 2016) we mainly use in this study is structured as a 5-tuple set (2): where user, timestamp, lat, lon, and mode are user ID, timestamp, latitude, longitude, and travel mode label (Table 1). The location (latitude, longitude) is then encoded as a spatial index, in accordance with the third level Japan Industrial Standards (JIS) X0410 grid square code denoted as l. The third level JIS X0410 code is encoded in 10 numeric digits and each grid spans 30 arcseconds in latitude and 45 arc-seconds in longitude, which approximately match 1 km of the study area.

Modeling human mobility sequence
Let U = {u1, u2, ...} be a set of users and T = {t1, t2, ...} be a set of temporal indexes with a constant time interval ∆t. To model human mobility data for prediction model, we process the dataset to produce human mobility sequence for each user, i.e., In the human mobility sequence of each user hmsu = rt 1 , rt 2 , ..., r is a set of location denotation and supplemental labels for the each temporal index. It is structured as a set of either 3-, 4-, or 5-tuples (4, 5, 6) depending on the input for the prediction model.
r mode,topic = {(l, mode, topic)} where 4 is for the model with only the location code l, 5 is for the model with the location code l and the transportation mode code mode, and 6 is for the model with the location code l, transportation mode code mode, and the topic feature topic respectively.

Modeling topic from SNS data
We use Twitter data and discovered topics from the data and integrate it to the model as the background information of the mobility. The tweet dataset is also structured as a table containing the user id, latitude, longitude, timestamp, and raw tweet. Supplemental components of original tweets such as URLs, hashtags, usernames, and location names in "check-in" tweets are removed.
For further processing, we use the following equations to define tweets: and the text in tweets: A tweet tw is a 5-tuple where user, timestamp, lat, and lon correspond to user ID, timestamp, latitude, and longitude, respectively, of a user with corresponding timestamps, and w is a bag-of-words containing Ni words. Let the vocabulary V = {1, 2, · · · } be a set of word IDs so that each word appears in the collection of words V at least once.

Topic Modeling from Twitter data
Several topic modeling modeles originally developed in (Miyazawa et al., 2019) are applied on the geo-tagged tweets.
The first model is the latent semantic analysis (LSA) based on an "online incremental streamed distributed training algorithm" (Řehůřek, 2011). LSA takes the frequency-inverse term frequency matrix as the input and computes a low-rank approximation of the input matrix using singular value decomposition: where X is the term-document frequency matrix, U and V are orthogonal matrices and Σ is a diagonal matrix; numterm is the number of terms, numdoc is the number of documents, and k is the dimension size (the number of topics). The ith column in X represents a vector corresponding to the ith document in relation to each term, while the ith column in V(di) becomes the vector corresponding to the ith document in the low directional space, where the number of topics = k; finally, the Σ k di for each document will be saved and used with the regression models.
Secondly, Latent Dirichlet Allocation (LDA) is performed. LDA is a probabilistic extension of LSA (Blei et al., 2003). LDA assumes that a set of documents are derived from k topics through a generative process where each topic has a multinomial distribution β k ∼ Dirichlet(η) over the vocabulary. For each document d, the distribution over topics θ d ∼ Dirichlet(α) is drawn followed by topic index z di ∈ {1, 2, · · · , K} and topic weights z di ∼ θ d ; finally word w di is drawn from the selected topic w di ∼ β zdi .In this study, a variation with faster online implementation (Hoffman et al., 2010) is used. Each probability corresponding to the topics for each document will be saved and used with the regression models.
Finally, the topic tensor T = {z k,l,t } is defined using the result of the topic models. It contains topic weights z of topic k on a collection of tweets falling under the spatial index l and the temporal index t. We experimentally set the latitude and longitude indices equal to each other as we processed the human mobility trajectory data, and the temporal index for every hour, and the number of topics k as 10.

PREDICTION MODEL
Our prediction model P θ (ĥmsu,t+1|hmsu,t) is that given a subset of human mobility sequence hmsu,t = rt1, rt2, . . . , rt, it predicts a human mobility sequence of next time step t + 1 hmsu,t+1 = rt+1. θ is a set of model parameters that is obtained by minimizing the model loss L(ĥmsu,t+1, hmsu,t+1) as follows: By applying the prediction model autoregressively, it predicts the human mobility sequence for several steps. As the location is encoded as the spatial index, each spatial index label is converted to a one-hot vector. Thereafter, the model tries to minimize the categorical cross entropy between the predicted distributionŷ l and the true distribution y l . Therefore, 10 can be further expanded as:

Model architecture
The model architecture is described in Fig. 2. It consists of embedding layers, multiple long-short term memory unit layers and one activation layer. For the output activation, we used Softmax in this study.

Trajectory embedding
The original location denotation in the sequence is the region ID. As the IDs are arbitrarily assigned to each grid cell, the spatial distribution of the grid cells should be learned from trajectory embedding; this idea is similar to Word2Vec in word embedding (Mikolov et al., 2013). This also contributes to reducing the dimension of the input vector, thereby reducing the memory and computational cost.

Multi-source data integration
Each mobility input, topic input, and travel mode input has an individual embedding layer. The output of the embedding layers are concatenated to one shared hidden layer. This would enable multi-modal network structure to learn multi-source data concurrently.

LSTM network
The shared layer uses the long-short term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997). LSTM network is an extension of recurrent neural network (RNN). It takes a sequence X = x1, x2, . . . , xT as the input and another sequence H = h1, h2, . . . , hT as the output to compute a mapping of each network unit using the following equations: where it, ft, and ot are input gate function, forget gate function, and output gate function at time t, respectively.Ct is the candidate values and Ct is the new value for the states of the memory cells at time t. xt is the input memory cell layer and ht is the representation layer at time t. Wi, Wc, W f , Wo, W b , Ui, Uc, U f , and Uo are weight matrices. bi, bc, b f , bo, and by are bias vectors. φ is the network activation function and Sotfmax was used in this study. Including the forget gate function is an attempt to mitigate the vanishing gradient problem.

Prediction evaluation scheme
The prediction model is trained to minimize the model loss (categorical cross entropy) defined in (11). The model with different parameter settings and the baselime models are then evaluated based on categorical accuracy. Among the input data, 80% of the data is used for training and remaining 20% is for evaluation.

Result of Topic modeling
To train the topic models, we selected the area of interest as the Greater Tokyo Area (138.72 to 140.87 in longitude, 34.9 to 36.28 in latitude) and the time period was from July 25, 2013 to July 31, 2013. The model was then applied to the tweets from different time spans to produce the topic tensor; this is discussed in the following section. We experimentally used the topic tensor from LDA in the prediction model. Table 2 shows some of the topics that are interpretable as distinct topics. Some topics can be considered strongly related to mobility and transportation infrastructure (topic 2), while some are related to news reporting (topic 7) or seasonal events (topic 10).
Topics Words  The words with high association to each topic are listed with translation. The authors interpreted and labeled the topic based on the words.

Prediction scenario and parameter settings
To evaluate the models' performance, we set a prediction scenario and parameter settings ( Table 3). The time period from 8 AM to 9 AM on July 26 (Thursday), 2012 was chosen as the target prediction span. Set within a morning commute, we expected the majority of the mobility to be the routine commutes, which have been a challenge for mobility prediction models in previous studies (Jiang et al., 2018b).  In addition to Deep Learning models, we added two baseline models: the N-Gram-Like and simple recurrent neural network (RNN) models. The N-Gram-Like model, which is typically used for natural language processing is trained using N and N-1 consecutive sequences of grid IDs; for each N-1 consecutive sequence, it computes the probability distributions for the next step. Here, we adopt Four-gram based model (Jiang et al., 2018b) which uses three steps to predict the next time step. The simple RNN model is comprised of traditional fully-connected RNN structure.
6.4 Performance evaluation 6.4.1 Choosing input data and network structure Table 4 compares the accuracy depending on the deep Learning model and parameter selection. The model based on 2 layers of LSTM and the embedding layer combining grid (location), transportation mode, and topic feature achieved the best accuracy among the other structures. The main model is LSTM with tanh as the output activation layer with dropout function. Other variations (LSTM-without-dropout, LSTM-softsign, LSTM-adam) performed worse than the main model. While adding only the mode feature improved the accuracy, adding only the topic feature did not improve the accuracy. However the combination of the three features improved the accuracy the most.
6.4.2 Computational cost, training data size and learning efficiency To conduct the experiment quickly, we only selected the experimental setting that required less than 24 hours for the training phases. Our primary computation machine consisted of eight CPU cores and one NVIDIA TITAN X GPU and make sure the model is not overfitting to the training data ( Figure 3). As Table 5 shows increasing the number of samples indeed improve the prediction accuracy up to 78.5% (Table 5). truth (red). Whereas the individual prediction accuracy represents the accuracy for each individual user, this figure represents the macro scale accuracy to be evaluated if the aggregated results of the model reflects the real-world human mobility. Overall, the predicted result reflects the spatial distribution of human mobility; however, the spatial discrepancy is most significant in rural areas most likely because of the lack of training data. Figure 5 shows the absolute percentage error of number of users (predicted and ground truth) to evaluate quantitative error in each cell. The spatial distribution of the error is fairly consistant that the cells with significant error (¿ 150%) are located not at the center of Tokyo but along transportation infrastructures.
6.6 Prediction result and share of each topic Figure 6 shows the prediction result of a sample trajectory. While the model inaccurately predicts a middle vertex, it predicts the destination accurately. By extracting the spatiotemporal topic feature, each trajectory can be individually analyzed based on which topics are significant at the moment of prediction.

CONCLUSION
In this study, we developed a multi-modal human mobility prediction model using LSTM combining GNSS trajectory and SNS data. The experiment demonstrated that the model successfully combined the mobility and topic features for the prediction scenario and performed better than typical baseline models. While the access to the input data limits the applicability of the study, the model structure can be introduced to other related applications.     The model produces a small error during the autoregressive process, but predicts correct destination for one hour later. The stacked chart shows the share of each topic with respect to the temporal index (in every 10 minutes). Topic 9 (commute/event) is gaining share in the prediction time span while topic 5 (daily life) is significant throughout the whole time span.