FAST SS-ILM : A COMPUTATIONALLY EFFICIENT ALGORITHM TO DISCOVER SOCIALLY IMPORTANT LOCATIONS

Socially important locations are places which are frequently visited by social media users in their social media lifetime. Discovering socially important locations provide several valuable information about user behaviours on social media networking sites. However, discovering socially important locations are challenging due to data volume and dimensions, spatial and temporal calculations, location sparseness in social media datasets, and inefficiency of current algorithms. In the literature, several studies are conducted to discover important locations, however, the proposed approaches do not work in computationally efficient manner. In this study, we propose Fast SS-ILM algorithm by modifying the algorithm of SS-ILM to mine socially important locations efficiently. Experimental results show that proposed Fast SS-ILM algorithm decreases execution time of socially important locations discovery process up to 20%.


INTRODUCTION
Socially important locations are places which are frequently visited by social media users in their social media lifetime (Dokuz and Celik, 2017).Socially important locations mining aims to discover important locations of social media users using their spatial social media histories.Discovering socially important locations reveal many information about spatial preferences of social media users, such as, which locations are important for a social media user, which locations are cooccurring among users, which user periodically visits a location, and which locations are common for a social media user group, etc.
Discovering socially important locations from social media datasets is challenging due to data volume and dimensions, high amount of spatial and temporal calculations, location sparseness in social media datasets, and computational complexity of current algorithms.
In the literature, several methods and algorithms are proposed to discover important locations.However, these studies have several limitations.Many of the studies used GPS or other data sources, and thus the methods and algorithms are based on these data sources.In addition, most of the studies do not aim to decrease the computational complexity of the algorithms.
Social media datasets are very sparse (Bao et al., 2015).In these dataset, some places are visited frequently and, in contrast, many places are visited rarely (e.g., only once).For example, Figure 1 shows the visited locations of a social media user in her/his social media history.Some statistical information about the user is also given in Table 1.As can be seen in the figure, the visited locations of the user are distributed among several cities.Many of the user's visit locations are in Istanbul, Turkey.However, the locations inside Istanbul are also distributed among different regions.As can be seen in the table, the user visited 95 distinct locations, and 65 of these locations are visited only once.For this user approximately 65% of the locations are visited only once.Because of this reason, these locations cannot be socially important locations of the user and they should be eliminated before socially important locations discovery process.Table 1.Statistical information of the social media user In this study, SocioSpatially Important Locations Mining (SS-ILM) algorithm, which is proposed by Dokuz and Celik (2017), is modified to decrease the execution time of the algorithm for faster discovery of socially important locations.The proposed Fast SS-ILM algorithm prunes rarely visited locations of social media users as early as possible, and so, unnecessary spatial calculations of non-frequent locations are avoided.With this strategy, the execution time of Fast SS-ILM algorithm decreases up to 20% with respect to the SS-ILM algorithm.
The rest of this paper is organized as follows.Section 2 discusses related work.Section 3 introduces basic concepts of socially important locations mining and SS-ILM algorithm, and presents proposed Fast SS-ILM algorithm.Section 4 experimentally evaluates Fast SS-ILM algorithm.Section 5 presents conclusions and future works.

RELATED WORK
Spatial data mining gained huge attention in social media networking domain after social media networking sites started collecting user spatial data (Bao et al., 2015;Kefalas et al., 2016).Spatial co-location mining (Celik, 2015;Celik et al., 2008;Yu, 2016) and spatial clustering (Hu and Sung, 2005;Tung et al., 2001) are main topics of spatial data mining which could be used in social media datasets.In social group-level socially important locations mining, Zheng et al. (2009) proposed approaches and models to discover top n interesting locations and top m classical travel sequences by considering users' experiences and the relationships between users.Khetarpaul et al. (2011) proposed relational algebra based operations to discover interesting locations by analyzing trajectories of multiple users.Dokuz and Celik (2017) proposed an approach to discover socio-spatially important locations from a group of social media users.The proposed approach can also discover user-level socially important locations.Although these studies propose novel methods to discover social group-based socially important locations, these studies do not focus on developing computationally efficient algorithms In this study, we propose Fast SS-ILM algorithm by modifying the algorithm of SS-ILM (Dokuz and Celik, 2017) to mine socially important locations efficiently.Proposed Fast SS-ILM algorithm prunes non-frequent locations as earlier as possible, and thus execution time of the algorithm decreases.

BASIC CONCEPTS AND MODELLING FAST SS-ILM ALGORITHM
In this section, first, basic concepts of socially important locations mining are given and then SS-ILM algorithm is introduced (Dokuz and Celik, 2017).Finally, the idea behind the Fast SS-ILM algorithm is presented.

Basic Concepts
Discovering socially important locations of a social media user group is composed of two parts, such as, user-level socially important locations discovery and social group-level socially important locations discovery (Dokuz and Celik, 2017).For user-level socially important locations discovery, the interest measures of location density and visit lifetime are used, and for social group-level socially important locations discovery, the interest measure of user prevalence is used as discussed in (Dokuz and Celik, 2017).The formulations of these interest measures are given as follows.
Definition 3.1.1.(Location Density): Location density is the proportion of number of occurrences of the user at a given location to the total number of occurrences of the user (Dokuz and Celik, 2017).
Definition 3.1.2.(Visit Lifetime): Visit lifetime is the proportion of user's first and last visit of the location to the user's first and last occurrence in social media history (Dokuz and Celik, 2017).
After calculation of each interest measure, the locations are checked whether they satisfy user-given min_density and min_visit thresholds.The locations which satisfy both thresholds are selected as socially important locations for user (SILU).
Definition 3.1.3.(User Prevalance): User prevalence is the fraction of the number of social media users who have location l as socially important location for user (SILU) to the total number of users (Dokuz and Celik, 2017).(3) After calculating user prevalence value, the locations are checked whether they satisfy min_UP threshold.The locations which satisfy min_UP threshold, are selected as socially important location for the user group (SIL).

SS-ILM Algorithm
SS-ILM algorithm, first, discovers user-level socially important locations (SILU) and from socially important locations for user (SILU) lists, it discovers social group-level socially important locations.Steps of SS-ILM algorithm are presented in Figure 2.

Figure 2. Steps of SS-ILM algorithm
As can be seen in Figure 2, first, the dataset is pre-processed and locations of each user is extracted.Then, using these locations of users, user-level socially important locations mining is performed and socially important locations for users (SILU) are discovered.Finally, social group-level socially important locations mining is performed and group-level socially important locations (SIL) are discovered.

Modelling Fast SS-ILM Algorithm
In this section, we introduce two new definitions, such as, occurrence count and candidate socially important location for user, to model our proposed algorithm, Fast SS-ILM.Based on these definitions SS-ILM algorithm (Dokuz and Celik, 2017)

PROPOSED FAST SS-ILM ALGORITHM
Basic SS-ILM algorithm discovers socially important locations of social media users by calculating location density and visit lifetime values of every location that the users visit.However, some of the locations are visited rarely, i.e. one or two times.Thus, it's obvious that one time visit to a location will not end up to be a socially important location since it is not frequent.If an early pruning operation could be performed, these locations can be pruned before calculating location density and visit lifetime interest measure values and so the execution time of the algorithm will be decreased.
Fast SS-ILM algorithm aims to prune non-frequent locations before calculating location density and visit lifetime values, and thus decreases execution time of socially important location mining process.It, first, checks whether the locations satisfy minimum occurrence threshold for being a candidate socially important location, and then, the locations which satisfy min_occurrence threshold are further analyzed for being socially important location for users.Algorithm 1 presents the proposed Fast SS-ILM algorithm.
As can be seen in Algorithm 1, only steps 5, 6 and 14 are added to the classical SS-ILM algorithm (Dokuz and Celik, 2017).At step 5, calculate-occurrence function calculates occurrence count of location l for user u.If occurrence count of location l satisfies min_occurrence threshold, the location becomes a candidate socially important location which means that the location is visited enough number of times to be socially important location for the user and thus location density and visit lifetime values of the location are calculated to check that the location is actually socially important location for the user.Otherwise, the location is pruned and location density and visit lifetime values are not calculated.By applying an early pruning operation for non-frequent locations, unnecessary calculation of location density and visit lifetime values are avoided.

EXPERIMENTAL EVALUATION
In this section, first the dataset is given, the pre-processing steps which are applied to dataset are discussed, and the experiments are presented.The experimental setup is given in Figure 3 The experiments are conducted on Intel Core i7 CPU with 3.40 GHz, and 8 GB of RAM.

The Dataset
Social media networks provide developers Application Programming Interfaces (APIs) (Twitter, 2017) to collect data from their servers.In this study, Twitter is used as a social media network and geographical Twitter data is collected.To collect data from Twitter servers, REST API and Streaming API were used.In addition, Twitter4j open source Java library (Yamamoto, 2017) was used to programmatically collect data from Twitter.Streaming API provides geographical boundary search on streaming tweets.To create the dataset with physically related social media users, Istanbul, Turkey based geographical search is performed and then users were collected.
Approximately 2500 users were collected in this step.Then, REST API is used to collect all tweets of the users.Three parameters were collected for each user; date/time, latitude, and longitude.The dataset in this study is the dataset from Dokuz and Celik (2017).

Pre-processing Steps
In this section, data cleaning, user selection, temporal overweighting prevention, and location labeling procedures are explained.

Data Cleaning:
In the experiments, we used the data from active Twitter users.In this study, active Twitter users are defined as users that send tweets no less than 50.If the number of tweets of a user is low, then the user is either passive or a new user.In the dataset, a proportion of Twitter users are spam users.To avoid spam users, we used two criteria; followers count and follower/friends ratio.If a users' followers count is less than 10 and follower/friends ratio is below 0.1, then this user is labeled as spam user and so that user was not included in the dataset.The values for these parameters are assigned according to many spam user detection literature and detailed information can be found in Benevenuto et al. (2010) and Zheng et al. (2015).

User Selection:
The aim of this study is to compare Fast SS-ILM algorithm with SS-ILM algorithm, and thus the dataset and the users should be same.For this purpose, user selection approach of SS-ILM algorithm is applied to this study.The details of user selection approach can be found in Dokuz and Celik (2017).

Preventing Temporal Overweighting:
When we analyzed the dataset, we realized that users may tweet (i.e., conduct social media activity) more than once at a location at the same time.If this behaviour becomes common, then a location might have more presence than its correct presence because the user was at that place once but tweeted several times.We defined this problem as temporal overweighting of a location.This problem is sometimes unintentional, such as a user has a conversation via tweeting to his/her friends and tweets several times within a short time span.To prevent temporal overweighting of a location, we defined a threshold, which is 60 minutes.If a user tweets more than once at the same location within 60 minutes, then we assume that this location information is not new and these tweets should be counted as once.With this approach, temporal overweighting of a location is prevented.However different approaches/criteria can also be applied to prevent temporal overweighting.

Location Labeling:
The Twitter APIs provide accurate latitude-longitude pairs of user tweets.This approach is beneficial for getting fine-grained results, but also a problem for location labeling.For example, a shopping mall or a stadium might be located in 1 km2 area but we could define many distinct locations for this shopping mall or stadium because the accurate latitude and longitude pairs do not match.To overcome this problem, we defined a threshold for being same location for different latitude-longitude pairs.As used before in (Pavan et al., 2015), we defined this threshold as 100 m.If two locations are closer than 100 m, same labels are assigned to these two locations.

Experimental Results
In this section, the experiments are presented to evaluate the performances of proposed Fast SS-ILM algorithm and classical SS-ILM algorithm (Dokuz and Celik, 2017).As can be seen in Figure 4, both algorithms tend to increase runtime with the increase of the number of users.As can be seen, Fast SS-ILM algorithm consumes less time with the increase of the number of users and so it is more efficient on handling the increase of the number of users.

Effect of min_occurrence Threshold:
In this experiment, we evaluated the effect of min_occurrence threshold on runtime of both algorithms.The values of min_density, min_visit, and the number of the users are set to 0.01, 0.05, and 1000, respectively.We increased min_occurrence threshold by 1 from 1 to 5. The effect of min_occurrence threshold is shown in Figure 5.
As can be seen in Figure 5, the runtime of SS-ILM algorithm keeps constant with the increase of min_occurrence threshold, however, the runtime of the proposed Fast SS-ILM algorithm decreases with the increase of min_occurrence threshold,.The proposed Fast SS-ILM outperforms the classical SS-ILM algorithm.Fast SS-ILM algorithm decreases runtime up to 20% with respect to the classical SS-ILM algorithm.As can be seen in Figure 6, red pins present SS-ILM algorithm results, and one additional blue pin presents extra location of 104 from min_occurrence value of 3. Dropped location is circled and overlined with red.Only one additional location is added to top 10 locations and one location is being dropped.
Based on the figure, only one of the top 10 locations change and other locations remain same.
As can be seen in Table 2, up to min_occurrence value of 3, there is no change on discovered socially important locations.
For the min_occurrence value of 3 and more, the locations' order changes.The reason for this is, the value of 3 is enough to discover candidate socially important locations for users who have relatively small social media history, and thus 3 and greater min_occurrence value changes the results of Fast SS-ILM algorithm.However, the top locations remain unchanged.

CONCLUSIONS AND FUTURE WORK
In this study, we proposed Fast SS-ILM algorithm to discover socially important locations in computationally efficient manner.The proposed algorithm is based on SS-ILM algorithm which is proposed by Dokuz and Celik (2017).Fast SS-ILM algorithm prunes non-frequently visited locations as early as possible and thus the number of candidate socially important locations decrease significantly.By decreasing candidate socially important locations, the spatial calculations of location density and visit lifetime measures decreases and so execution time of socially important locations discovery process decreases.Experimental results showed that the proposed Fast SS-ILM algorithm outperformed the classical SS-ILM algorithm.
As future works, Fast SS-ILM algorithm could be applied to big datasets and it could be applied to other application domains of social media mining, such as, location recommendation for social media users.

Figure 1 .
Figure 1.Visited locations of a social media user is modified.Definition 3.3.1.Given a location l and a social media user u, the occurrence count of u at l is the number of occurrences of u at location l.Definition 3.3.2.Given a location l and a social media user u, the location l is a candidate socially important location for user u if occurrence count of user u at location l satisfies min_occurrence threshold Figure 3. Experimental setup

5. 3 . 1
Effect of The Number of Users: In this experiment, we evaluated the effect of the number of users on runtime of algorithms of Fast SS-ILM and SS-ILM.The values of min_occurrence, min_density, and min_visit are set to 3, 0.01, and 0.05, respectively.We increased the number of users by 200 from 200 to 1000.The effect of the number of users is shown in Figure 4.

Figure 4 .
Figure 4. Effect of the number of users

Table 2 .
Figure 6.Socially important locations that are discovered by algorithms

Table 2 .
Top 10 socially important locations for algorithms and for min_occurrence threshold values of 1 to 5