Performance analysis of classification methods for indoor localization in VLC networks

: Indoor localization has gained considerable attention over the past decade because of the emergence of numerous location-aware services. Research works have been proposed on solving this problem by using wireless networks. Nevertheless, there is still much room for improvement in the quality of the proposed classification models. In the last years, the emergence of Visible Light Communication (VLC) brings a brand new approach to high quality indoor positioning. Among its advantages, this new technology is immune to electromagnetic interference and has the advantage of having a smaller variance of received signal power compared to RF based technologies. In this paper, a performance analysis of seventeen machine leaning classifiers for indoor localization in VLC networks is carried out. The analysis is accomplished in terms of accuracy, average distance error, computational cost, training size, precision and recall measurements. Results show that most of classifiers harvest an accuracy above 90%. The best tested classifier yielded a 99.0% accuracy, with an average error distance of 0.3 centimetres.


INTRODUCTION
Indoor localization has been a term of growing interest over the past decade as lightweight mobile devices have become the standard in the real world.Many user applications for these devices need some notion of the current position, and hence, the development of localization techniques is one of the keys to the success of pervasive computing.Thus, location-aware services have made it possible to use applications capable of sensing their location and modifying their setting and functions accordingly (Want, 2001).
Many indoor localization approaches based on globally deployed radiofrequency communication systems, such as WLAN, Bluetooth and UWB, have been proposed, mainly because of their low cost and mature standardization state.In these systems, the fingerprinting technique is one of the most commonly used for indoor localization (Honkavirta, 2009).This kind of technique estimates positioning by matching online measured data with pre-measured location-related data, such as received signal strength (RSS).Hence, just RSS information is needed and extra sensors are unnecessary.Localization based on fingerprinting is usually carried out in two phases.In the first phase, normally termed offline phase, a database of the RSS samples is built from different base stations at each reference location for the target environment.Using those samples as a training set, a positioning model is learnt using a particular machine learning technique.In this phase, it can be found a great diversity on the applied methodologies.During the second phase, namely the online phase, the location is determined by means of new RSS measurements collected in a specific position and using the learnt model in the previous phase.
A major drawback of fingerprinting techniques is that the key parameter (RSS) for predicting the position of a device is not stable with time due to dense indoor multipath effects, such as reflection, diffraction and scattering.Multipath fading causes the received signal to fluctuate around a mean value at a particular location (Kaemarungsi, 2012).Therefore, they usually deliver an accuracy of up to two meters, since they are hindered by multipath propagation.
On the other hand, VLC is experiencing a growing interest because of improvements in solid state lighting and a high demand for wireless communications.Although line of sight is necessary for efficient communications in VLC networks, this kind of network infrastructure can offer a higher positioning accuracy mainly because of two reasons: it is not affected by electromagnetic interferences and the received optical power is more stable than radio signals, so it can be accurately determined (Armstrong, 2013).Therefore, fingerprinting techniques are expected to yield higher accuracy in VLC networks.In this paper, a performance analysis of different machine learning classifiers using RSS samples is carried out.RSS values are obtained using a VLC simulator that implements the IEEE 802.15.7 standard.To be precise, six classifiers are studied, namely K-Nearest Neighbour, Random Forest, C4.5, REPTree, KStar and LMT.Furthermore, Boosting and Bagging techniques are also analysed using the previous classifiers as "weak learners".Hence, seventeen classifiers are analysed in this paper.The analysis is carried out in terms of accuracy, average distance error, computational cost, training size, precision and recall measurements.
Within the last few years, many studies on VLC based positioning have been published.Nevertheless, to the best of our knowledge, to this date there are no published indoor positioning research papers where a performance analysis of different classifiers is carried out.
The rest of the paper is organized as follows.In Section 2, we describe our simulator that implements the IEEE 802.15.7 standard for VLC networks.Next, in Section 3, machine learning classifiers used in this paper are briefly described.In Section 4, the test environment is defined.In Section 5, the evaluation of specific parameters on the performance of classifiers is shown.Finally, the conclusions are summed up and future works are presented.

SIMULATOR DESCRIPTION
In our research group, we have developed a simulator for IEEE 802.15.7 networks.It was developed using OMNET++ (Omnet, 2009) simulation framework from the model developed by (Chen, 2008) designed for sensor networks based on the IEEE 802.15.4 standard, because of the similarities existing between IEEE 802.15.7 and IEEE 802.15.4 architectures.
OMNeT++ provides built-in support tools not only for simulation, but also for the analysis and visualization of results.Several data types can be used to analyse simulation results, such as throughput, delay, packet loss and RSS.In this paper, the simulator is used to obtain RSS samples in a receiver grid acquired from the signal coming from different emitters (also called coordinators).These RSS samples are used as features for training the classification methods.
The developed simulation model has been designed with the following premises: -IEEE 802.15.7 star topology has been chosen because of its importance and wide range of applications.-For the MAC layer, we opted to use the superframe structure; since it allows the use of both contention (CAP) and no contention (CFP) access methods.In addition, the use of the superframe enables devices to enter the energy save state during the idle period.-A VPAN identifier is assigned to each emitter to identify each coordinator (LED lamp).
In the next subsections, the most important features in our simulator is described, for a better comprehension of the presented results.

Optical channel model
The transmission medium is modelled as free space without obstacles.The directed line of sight (LOS) link configuration was chosen to model the optical signal propagation, requiring a LOS between each device and the coordinator.Only the direct component of the received signal was considered to calculate the received power, neglecting the possible influence of reflections.
Frequency response of optical channel is relatively flat near Direct Current (DC), so the most important quantity for characterizing this channel is the DC gain H(0) (Kahn, 1997), which relates the transmitted and received optical average power, see Equation 1: In VLC, the received power can be expressed as the sum of LOS and non-LOS components.In directed LOS links, the DC gain can be computed accurately by considering only the direct LOS propagation path.According to the results presented in (Komine, 2004), at least 90% of total received optical power is direct light in VLC when using a receiver field of view (FOV) of 60 degrees.Figure 1 shows an example of a directed LOS link.
An optical source can be modelled by its position vector, a unitlength orientation vector transmission power Pt and a radiation intensity pattern I(θ,m) emitted in direction θ.Where m is the mode number of the radiation lobe, which specifies the directionality of the source, and is related to the transmitter half power angle θ1/2.Similarly, a receiver is defined by its position, orientation , photo detector area A, and FOV (ψc).The angle formed between the optical incident signal and the orientation vector is called the incident angle ψ.The maximum incident angle defines the receiver FOV.Considering LOS propagation path, the DC gain can be calculated according to (Kahn, 1997) as Equation 2: transmission speeds, since the effects of multipath distortion on the optical signal are not considered.Considering only the direct component of the signal has the additional benefit of improving the efficiency of the implemented simulation model.The computational load required to run simulations of scenarios with multiple nodes including the functionality of different layers of the architecture is reduced significantly.
To ensure the validity of our implemented model, we have configured all optical receivers using a 60 degrees FOV value (ψc).

PHY layer simulation parameters
Table 1 shows the main configuration parameters of PHY layer used in all simulation scenarios.We selected the PHY II operating mode, intended for both indoor and outdoor environments, using MCS-ID number 16, since support for the minimum clock and data rates for a given PHY is mandatory.
Because of the optical channel model used, transmitters' directivity is characterized by its half power angle, θ1/2 while receivers' directivity is defined by its FOV.According to (Chvojka, 2015), both parameters are assigned a value of 60 degrees, to ensure validity of the implemented channel model, since the calculation of received optical power takes into account only the direct component of the signal.
In order to simplify the calculation process of the model, the values used for the concentrator gain (G(ψ)) and the transmission coefficient of the optical filter (Ts(ψ)) are set up as constant values, so they do not depend on the angle of incidence ψ.
The rest of the selected values employed to characterize VLC transmitters and receivers are commonly used values in literature, similar to those used in (Chvojka, 2015) (Tronghop, 2012).

MACHINE LEARNING CLASSIFIERS
In this section, a brief description of used classifiers in this paper is outlined.

K-Nearest Neighbour
KNN is a machine learning algorithm that predicts the classification of new data based on the closest training samples in the feature space (Cover, 1967).The algorithm decides which class is similar by picking the K nearest data point distances to the observation.Then, simple majority of neighbours is used to determine the class prediction.In this paper, IB1 implementation (Aha, 1991) of K-NN was used and the number of nearest neighbours was established in K=1 because this configuration provided better results.

Random Forest
RandomForest (RF) was proposed by Breiman (Breiman, 2001).Random Forest is an ensemble of decision trees such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.As the number of trees in the forest becomes large, the generalization error converges to a limit.The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them.
The classification is done by a majority vote among the decisions of all trees.The randomness introduces robustness to the algorithm against noise and outliers.Random Forest is equally applicable to both classification and regression problems.In this paper, the number of trees was established in 100.

C4.5
C4.5 is an algorithm used to generate a decision tree (Quinlan,1993).C4.5 builds decision trees from a set of training data using the concept of information entropy.At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other.The splitting criterion is the normalized information gain (difference in entropy).The attribute with the highest normalized information gain is chosen to make the decision.The C4.5 algorithm then recurs onto the smaller sublists.In this paper, a confidence factor equals to 0.25 was used because better results were reached.

REPTree
Reduced Error Pruning Tree is a fast decision tree learning and it builds a decision tree based on the information gain (Srinivasan, 2014).REP Tree builds a decision/regression tree using information gain as the splitting criterion, and prunes it using reduced error pruning.It only sorts values for numeric attributes once.

LMT
A Logistic Model Tree basically consists of a standard decision tree structure with logistic regression functions at the leaves (Landwehr, 2005), much like a model tree is a regression tree with regression functions at the leaves.It combines the logistic regression models with tree induction, and thus is an analogue of model trees for classification problems.As in ordinary decision trees, a test on one of the attributes is associated with every inner node.For a nominal attribute with k values, the node has k child nodes, and instances are sorted down one of the k branches depending on their value of that attribute.For numeric attributes, the node has two child nodes and the test consists of comparing the attribute value to a threshold: an instance is sorted down the left branch if its value for that attribute is smaller than the threshold and sorted down the right branch otherwise.

KStar
K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function.It differs from other instance-based learners in that it uses an entropybased distance function.(Cleary, 1995).

Boosting
The boosting method is a technique to improve the classification accuracy of tree based classifiers.The idea of boosting is to combine the prediction of many base or weak classifiers to form a powerful classifier.AdaBoost is the most popular boosting algorithm used for classification (Freund, 1996).It is an adaptive and iterative algorithm that combines base models of the same type, such as a C4.5 decision tree, in such a way that each new base model is influenced by the performance of those base models built in previous iterations.In this paper, we use all aforementioned classifiers as base model, except the K-NN classifier because boosting is not possible with K=1.The algorithm was iterated 10 times.

Bagging
Bagging (Bootstrap Aggregating) was proposed by Breiman (Breiman, 1996) to improve the classification by combining classifications of randomly generated training sets.By increasing the size of the training set the model predictive force cannot be improved, but just decrease the variance, narrowly tuning the prediction to expected outcome.It also helps to avoid overfitting.
Although it is usually applied to decision tree methods, it can be used with any type of method.In this paper, we use all aforementioned classifiers as base model.The algorithm was iterated 10 times.

TEST ENVIRONMENT
In this section, we describe the test environment used to evaluate the performance of machine learning classifiers described in Section 3. The simulation environment configured in our IEEE 802.15.7 simulator models a 4 by 4 by 3 metres room.The scenario is shown in Figure 2.This environment consists of 16 coordinators or LED lamps (red triangles), configured a 4 x 4 grid placed 1 meter apart from each other on the ceiling.On the lower part, we set up 100 receivers (blue circles) in a 10 x 10 grid configuration, with a 36 centimetres separation from each other.In order to evaluate the effects of having different distances between receivers and coordinators, the receivers plane is set up at three different heights: 75, 100 and 125 centimetres from the floor.Receivers orientation was randomly assigned for each simulation as follows: they are pointing out to the ceiling with an initial orientation vector [0,0,1] and a random offset (-0.2,0.2) is applied to each axis in each simulation.Thus, each receiver has a different orientation in each simulation.
Eleven simulations were performed on each one of three aforementioned receiver planes.One RSS measurement from each LED lamp was estimated at each receiver in every simulation.This leads to 3,300 (11 samples x 3 layers x 100 receivers) RSS measurements from each LED lamp.Hence, the dataset is finally composed of 3,300 instances, where each instance stores the RSS samples from each LED lamp estimated in a receiver.Figure 3 shows the received optical power (lux) at 1 metre from the floor with sixteen coordinators.It shows that there is enough lighting to receive the beacon frame in every reference location.The simulation parameters were specified in Table 1.

EXPERIMENTAL RESULTS AND DISCUSSION
In order to evaluate the performance of the machine learning classifiers on VLC networks, the WEKA machine learning tool (Hall, 2009) was used.
Weka is an open source collection of machine learning algorithms for data mining tasks, more specifically data pre-processing, clustering, classification, regression, visualization and feature selection.
Experiments were focused onto comparing accuracy, error distance, computation time, training size, precision and recall measurements by different classifiers.The error is the expected distance from the misclassified instance (estimated receiver) and the real location (real receiver).The error is calculated by the Euclidean distance between these points, and the arithmetic mean was computed from the results of the experiments.Being a  classification problem, an error simply means that a receiver was estimated to be in a wrong positioning cell, in the receiver's grid.
All experiments were carried out on an Intel Core i7 3.4 GHz/32 GB RAM non-dedicated Windows machine.

Accuracy of classifiers and computational cost
In this section, the performance of classifiers is analysed.For the validity of experimental results, the experiments were carried out using 10-fold cross-validation.
Table 2 shows the accuracy, error distance and the time to build the analysed classifiers, that is, training time.As can be seen, all classifiers have an accuracy above 90%, except REPTree algorithm.K-NN classifier obtains the best result, yielding a 99.0% accuracy, with an average error distance of 0.3 centimetres, 6 times less than the next best classifier, KStar.Furthermore, K-NN algorithm is the fastest to build the classifier.
On the other hand, Boosting and Bagging techniques outperform the performance of base classifiers C4.5, REPTree and LMT, but at the expense of much higher computation effort.Furthermore, it is noticed that Boosting techniques are slightly more accurate than Bagging techniques.
Figure 4 shows the cumulative distribution function (CDF) for the best analysed classifiers, that is, K-NN, Random Forest, Adaboost C45, AdaBoost RepTree, KStar and AdaBoost LMT.
As can be seen, most of the test instances are correctly classified, and most of the misclassified instances are about 36 centimetres, that is, these instances are the nearest neighbours (receivers) of exact locations in the same height.On the other hand, the maximum error of K-NN and KStar classifiers is about 50 centimetres.
Table 2. Performance of classifiers.

Analysis of Training Dataset size
The training dataset size is an important parameter for the performance and the building time of each model based on decision trees.A large-sized training dataset can provide better accuracy to predict the correct location, but too much data can increase the elapsed time to build the model considerably.The aim is to reduce as much as possible the training phase achieving a minimal impact on the performance.In order to test the robustness of the method, different training dataset sizes were used, from 20% to 80% of the whole dataset.For the validity of experimental results, the experiments were performed 100 times, each time selecting the training and testing data after randomizing the instances order, picking the same proportion of samples at each class (stratified split).

CONCLUSIONS
In this paper, we have analysed the performance of different machine learning classifiers for indoor localization in VLC networks.Accuracy, error distance, computational cost, training size, precision and recall measurements were evaluated.
Regarding accuracy, most of the analysed classifiers yielded excellent results, above 90% accuracy.This is mainly because the visible light is less susceptible to multipath effects making the propagation and the received optical power more predictable.The best classifier (K-NN) yielded a 99.0% of instances correctly classified and average error distance of only 0.3 centimetres.Also, this classifier was the best performer in terms of precision and recall measurements even for smaller training sets.In addition, the training time spent to build the classifier is the lowest, about 20 milliseconds.On the other hand, the accuracy, precision and recall measurements improve when training dataset size increases, although it needs higher computation effort.Furthermore, the error distance is less than 10 centimetres using only a 60% training dataset size for all classifiers.Hence, it demonstrates that VLC networks may be used for indoor localization based applications with high accuracy constraints.
Since the average error distance of misclassified instances cannot be less than the distance among receivers when classifiers are used, in our ongoing work, we are planning to use other techniques of data mining, such as regression, to reduce the error distance.Moreover, we are also planning to use principal component analysis to reduce the data dimensionality, and hence, the computation time to build the model could be reduced and the system accuracy could be improved.

where
Figure 1.Directed LOS link configuration.

Figure 2 .
Figure 2. Scenario with 16 LED lamps and 100 receivers.Figure3.Distribution of the received optical power at 1 metre from the floor.

Figure 3 .
Figure 2. Scenario with 16 LED lamps and 100 receivers.Figure3.Distribution of the received optical power at 1 metre from the floor.

Table 3 .
Performance with different training sizes.