NOVEL EVALUATION INDEX OF CROSS-SCALE DISCRETIZATION UNCERTAINTY BASED ON LOCAL STANDARD SCORE

: Optimal discretization of continuously valued attributes is an uncertainty problem. The uncertainty of discretization is propagated and accumulated in the process of data mining, which has a direct influence on the usability and operation of the output results for mining. To address the limitations of existing discretization evaluation indices in describing accuracy and operation efficiency, this work suggests a discretization uncertainty index based on individuals. This method takes the local standard score as the general similarity measure in and between the intervals and evaluates discretization reliability according to the relative position of individuals in each interval. The experiment shows the new evaluation index is consistent with commonly used metrics. Under the premise of guaranteeing the validity of discrete evaluation, the proposed method has greater description accuracy and operation efficiency than extant approaches; it also has more advantages for massive data processing and special distribution detection.


INTRODUCTION
The discretization of continuously valued attributes is an essential and important step during data mining. Many existing data mining algorithms only target discrete attributes and when these algorithms are applied, the continuous attributes are first discretized. There is a plethora research on discretization effect evaluation, and many classical evaluation coefficients have also been widely used. According to different evaluation objects, the existing evaluation of discretization effects can be divided into two levels: global evaluation and individual-based evaluation. The global evaluation method evaluates the overall rationality of the discretization interval. The commonly used coefficients, such as the Dunn validity coefficient proposed by Dunn et al., calculate the ratio of the maximum distance in the interval to the minimum distance between the intervals. Superior discretization should have a large distance between the intervals, and the cohesion in the interval is strong. Chou et al. proposed the CH coefficient, and evaluated the discretization effect by calculating the sum of the squared distances of the data distances in the interval and the square of the distance between the center points of the entire data set. Similarly, the I coefficient proposed by Davies and Bouuldin also evaluates discretization by calculating the distance within the interval and between the interval. However, the global coefficient quantitatively evaluates only one coefficient value for a column of data, and it is impossible to describe and analyze the discretized structural details. Therefore, individual-based assessment can effectively compensate for this deficiency. For example, Rousseeuw raised the Silhouette Index to calculate the coefficient values for each object in the data set, and evaluate the discretization superiority of each object. However, facing the large data sets, the method of Silhouette Index requires more time to perform multiple calculate on an object-by-object basis. With the rapid increase in data size and the increasing complexity of data forms, there is an urgent need for new discretization assessment methods that adapt to massive  Corresponding author data sets and complex data patterns.
Therefore, a new discretization uncertainty coefficient evaluation method based on local standard score construction was proposed in this paper. This method uses the standard score to construct the overall similarity measure of the data within the interval and between the intervals, and evaluates the discrete reliability by considering the relative position distribution of the individual in each interval.

The source of continuous attribute discretization uncertainty
In the process of discretization of continuous attributes, the distribution characteristics of the original data set will be changed due to the concept level of the attribute. Mapping continuous data to several discrete intervals loses the continuous change details of complex data, and the amount of information contained in the data is also reduced, resulting in uncertainty in discretization results and application analysis. The uncertainty description in the discretization result consists of two parts: the discretization term set X and the probability distribution U(X)  [0, 1] of this set. For a data set containing m consecutive attributes, the number of data entries in the attribute value range is recorded as n. The uncertainty of the discretization interval corresponding to any record X under each successive attribute is represented by a discrete interval value representing the attribute i in the record j, Uij indicates its corresponding uncertainty.
Therefore, the uncertainty in discretization is a kind of uncertainty derived from the data itself and the concept. Each record in the data set (each attribute value field) can be represented by the corresponding uncertainty probability. The uncertainty can be divided into two aspects: the degree of cohesion within the interval (reflecting how closely the objects in the interval are closely related) and the degree of separation between the segments (reflecting where a certain interval is different from other intervals).
Discretized cohesion and resolution are manifested in this relationship. The effectiveness evaluation of the results of discretization alone with cohesion or resolution is unreliable. The more division intervals there are, the greater the degree of separation between the intervals, and the smaller the degree of cohesion within the interval. Under extreme conditions, each object corresponds to a subinterval, with itself as the center of the interval. At this time, the error in the interval is 0, and there is no uncertainty, but the ideal discretization is actually not discretized, which is not conducive to subsequent calculation and analysis. Therefore, the discretization uncertainty assessment needs to be based on a combination of cohesion and resolution, and finding a balance between the cohesion within the interval and the resolution between the intervals, so as to evaluate the discretization results.

Discretization uncertainty coefficient
During an individual-based uncertainty assessment, each object in the discretization result is calculated for its uncertainty in the discrete interval to which it belongs. The concept of standard scores is introduced in this study to evaluate the degree of cohesion in the interval and the degree of separation between intervals.
The standard score is the measure of the discrete distribution between the object and the mean in units of standard deviation. Firstly, compare the original value of an object in the discrete interval with the average level of the interval, and then judge the continuous distribution level of the object from the whole by the standard deviation. It can be seen that the standard score is a quantification of the distribution level of an object in this discrete interval. Compared with the average distance, the calculation of the standard score can reflect the relative standard distance of the object distance interval or the neighborhood interval, so that the deviation distribution level of the individual object in the discretization can be better evaluated. The individualized degree of cohesion is measured by calculating the standard score of each object's distance from the center of the interval in the interval based on the individual's uncertainty coefficient, and the individual score is measured using the standard score of each object from the center of the nearest neighbor. The value of the coefficient is obtained by the ratio of the degree of aggregation to the degree of separation.
For a given continuous attribute , n is the number of data objects in X. Using some discretization algorithm to divide X into k intervals, the discretization result is recorded as 12 , ,..., Where i a is the standard score of i x relative to its associated interval j I , and i b is the standard score of i x relative to its the single adjacent interval is only taken. The uncertainty coefficients k UI for discrete intervals are defined as follows: Where n is the number of objects in the data set, and k (the average uncertainty coefficient) is the number of discrete intervals, which is a measure of the overall uncertainty of the partitioning of the entire data set. The value of the uncertainty coefficient varies from 0 to 1. A value of 1 indicates that the discrete distribution of the object relative to its interval is equal to or greater than the degree of dispersion of the relative neighborhood interval, and the uncertainty reaches the highest. The closer the value is to 0, the smaller the uncertainty of the object.

Comparison of discretization evaluation coefficients and characteristics of uncertain coefficients
There have been many results on the evaluation of discretization effects as well as on classical evaluation coefficients. Although the starting points of these two methods are different, they are both based on the overall similarity of the data between the subintervals and sub-intervals in the discretization results. The distance of the data represents the degree of difference between the data. According to the discretization requirement, the result of the discretization of the continuous attribute should make the distance of the data between the intervals as large as possible, and the distance of the data in the interval smaller. According to different evaluation objects, the existing discriminant effect evaluation can be divided into two levels: global evaluation and individual-based evaluation. The existing mature evaluation coefficient is compared with the construction principle of the uncertainty coefficient proposed in this study. The results are shown in Table 1 below.
Compared with the existing discretization evaluation coefficient, the uncertainty coefficient proposed in this paper has the following characteristics. (1) Consider the contribution of each individual's discretization to uncertainty.
The uncertainty coefficient is calculated in units of individuals, which fully reflects the heterogeneity of the discrete individuals. The average uncertainty coefficient is obtained from the discretization uncertainty of each individual. Compared with the traditional global evaluation coefficient, it has a significant improvement in the granularity and flexibility reflecting the The range of the largest interval in all intervals The smallest twopoint distance between the interval and the interval The sum of the squares of the distances between the points in the interval and the center of the interval The sum of the squares of the distance between the center of each interval and the center of the data set The sum of the distances from the points in the interval to the center of the interval The maximum distance between the interval and the center distance of the interval Global Table 1. Comparison of discretization evaluation coefficient degree of discretization reliability. Therefore, extensive analysis of the refined uncertainty distribution can be explored.
(2) Efficiency improvement Compared to the current individual-based contour coefficient, the uncertainty coefficient significantly improves operational efficiency. The contour coefficient needs to scan the data repeatedly during the calculation. However, the calculation of the uncertainty coefficient is based on the standard score as the dimension of the degree of cohesion and resolution, which greatly simplifies the time complexity of the coefficient calculation, thereby improving the evaluation efficiency and better service to the analysis and application of massive data.

Effectiveness verification
To verify whether the uncertainty coefficient proposed in this study can effectively evaluate the reliability of the discretization results, this paper compares the uncertainty coefficient with the existing evaluation coefficient in the global evaluation and individual evaluation. The experimental data includes one set of simulation data and one set of actual data. The experimental hardware comprises an Intel Core i7 with 3.60GHz CPU and 8GB memory. The operating system platform is Microsoft Windows 7 Ultimate, and the software programs are Microsoft Visual C++ 6.0 compiler and Matlab R2014a.
The experimental data verified by the global evaluation is simulated data and contains 500 samples. Four range partitions were added to the continuously evenly distributed data to form five separate intervals, as shown in Figure 1. The experimental data known for discrete distributions were discretized into 2-10 classes using EW, K-means and FCM algorithms. In addition to the uncertainty coefficient U, the global evaluation coefficient (Dunn coefficient, CH coefficient, I coefficient) is used to evaluate the discretization effect, so as to verify the validity of the uncertainty coefficient applied to the discretization evaluation.
The relevant data is shown in  Table 2. Effectiveness evaluation of simulated data under three discretization algorithms Considering the dimensional difference between the coefficients, the four series of values obtained by the four coefficients are normalized, so that the evaluation coefficients are applied to the simulation data and the effects of discretization into 2-10 intervals under different discretization methods are evaluated and compared. Specifically, the method is to subtract the minimum value in the column from the calculated value of each coefficient evaluation, and then divide by the difference between the maximum value and the minimum value of the data in the series.
For the value trend of the uncertainty coefficient U and the existing coefficient (U=1-U), the larger the value, the more reasonable the discretization. Figure 2 shows the comparison between the results of the four coefficient evaluations after standardization.
Since the metrics for cohesion and resolution vary, the calculation results of the existing evaluation coefficients are different. The evaluation of the discretization results can be divided into two aspects: the detection of the optimal number of discrete intervals and the comparison of the discretization methods. Take Figure 2 (a) as an example, the EW method is used to discretize the simulation data into two-ten intervals. The four evaluation coefficients show that the discretization is most reasonable when the simulation data is divided into five discrete intervals of equal width. This is consistent with known discrete distribution characteristics of data. The uncertainty coefficient U proposed in this study is consistent with the calculation results of the current classical coefficients in the detection of the optimal number of discrete intervals.
By comparing the optimal number of simulation data under the three discretization methods of EW, K-means, and FCM and the values of the four evaluation coefficients, the following can be known.
(1)For the EW method and the FCM method, the four coefficients show that the best discretization effect is achieved when divided into five discrete intervals, and the interval range is completely the same; that is, the simulated data is divided into five equal parts with the same volume. K-means is best when segmenting the simulated data into six intervals.
(2)Comparing the coefficient values of the three discretization methods under the optimal interval number, it can be seen that the values of the Dunn coefficient, the CH coefficient and the I coefficient of the K-means method are lower than the EW method and the FCM method, but the coefficient U is not determined. Figure 2 (b) shows that the coefficients in the corresponding graphs of the EW method and the FCM method are the normalized maximum value 1 at the 5th interval. However, the K-means method reaches the series maximum at the 6th interval, but both are less than 1, and the Dunn coefficient is even 0.03, Figure 2. Evaluation of discrete effect of simulated data whose discretization effect is worse than the other two methods. Therefore, the discretization of the simulated data should be divided into five discrete intervals using the EW or the FCM method, which is consistent with the known data distribution characteristics.
Through the experimental evaluation and comparison of the known discrete distribution simulation data, it can be seen that in the comparative evaluation of the detection and discretization methods of the optimal discrete interval number, the uncertainty coefficient U proposed in this study is compared with the evaluation result of the current coefficient. Consistency with the actual distribution characteristics of the data is an effective way to evaluate the results of discretization.

Individual evaluation
The individual evaluation verification uses the IRIS data set to compare the existing discretization evaluation methods and the evaluation results of the uncertainty coefficients. IRIS data, also known as the iris flower dataset, is recognized as the most famous dataset for data mining. The data set consists of 150 data consisting of four consecutive attributes, which are the length of the flowerbed, the width of the flower, the length of the petals, and the width of the petals.
In the experiment, the K-means algorithm is applied to discretize the four consecutive attributes in the IRIS data set, which are divided into three categories for comparison. A comparison of discretization evaluation coefficients (see section 2.3) reveals that only the contour coefficients and the uncertainty coefficients proposed in this study can evaluate the discretization uncertainty of each data, and verify the validity of the proposed uncertainty coefficient applied to discretization evaluation. The contour coefficient calculates the average distance to other individuals in the dataset and to all individuals in the most adjacent interval for each individual in the data set, thereby jointly evaluating the discretization superiority of each individual.
The K-means algorithm is used to discretize the four attributes respectively to obtain the S and U coefficients of each individual, then the individual uncertainty evaluation is conducted. Since the contour coefficient S and the uncertainty coefficient U are opposite in the evaluation reliability, the difference between the effective analogy uncertainty coefficient U and the current evaluation method is that U is taken as 1-U; if the value is larger the discretization is more reasonable. The results of comparison of the similarity between S and U during the individual evaluation are shown in Figure 3. The abscissa indicates the number of data records, and the ordinate indicates the calculated contour coefficient S and the uncertainty coefficient U proposed in this study.
The four consecutive attributes in the IRIS data set, the S and U coefficients can effectively reflect the uncertainty of each of the data in the discretization. For each discrete interval, the individual discretization reliability in the middle of the interval is higher, and the individual's uncertainty increases when it is closer to the segmentation point. Furthermore, it is understood that the contour coefficient S and the uncertainty coefficient U each have a strong similarity of 0.8 or more in each experimental data when the Pearson correlation of the S and U coefficient sequences are calculated. During the individual-based discretization uncertainty assessment, the uncertainty coefficient U based on the local annotation score and the existing contour coefficient S have an extremely high distribution similarity, and the discriminant uncertainty evaluation effect for each individual is consistent.

Calculation efficiency
It can be seen from the experiment in section 3.1 that the proposed uncertainty coefficient U is basically consistent with the rule mining result obtained by evaluating the discretization reliability of the existing contour coefficient S. For the uncertainty coefficient method, since it simplifies the time Figure 4. Running-time experiment of Silhouette coefficient and uncertainty coefficient complexity of the algorithm, it will achieve higher processing efficiency in the face of massive data. Take analog data as an example, a random number of normal distributions are generated in Matlab, and the amount of data is gradually increased from 100 to 40,000. The data set is discretized into three intervals by the K-means algorithm. In Matlab, the validity of each simulation data set is verified by the contour coefficient S and the uncertainty coefficient U. As the amount of data increases, the calculation time is as shown in Table 3. It can be seen that with the increase of data volume, the uncertainty coefficient U proposed in this study has significant time superiority as shown in Figure 4, which greatly reduces the time consumption of individual evaluation and improves the efficiency of data processing.

Breakpoint recognition
Firstly, a random data set with 1000 entries is generated by Matlab2016. Then two partition points are set to the value domain partition to form three separate intervals, and the simulated dataset of the hierarchical sequence distribution is obtained. The K-means method is used to discretize this dataset into low, medium and high, which are labeled 1 to 3, and their distribution is shown in Figure 5 (a). It can be seen that the 1 st interval includes a portion with a range of 0-0.1 and 0.4-0.64, which is partitioned across a range. The 2 nd interval range is 0.64-1.15, which is evenly distributed within the interval. The 3 rd interval range is 1.55-1.7, which forms a value domain fault with the 2 nd interval.
The uncertainty of the simulated data is evaluated by the contour coefficient S and the uncertainty coefficient U, and the results are shown in Figure 5 (b) (U takes 1-U). It can be seen that the overall distribution of S and U is relatively consistent, the reliability is higher in the middle of each interval, and that the uncertainty is closer to the boundary area.
Since the distribution of the 1 st interval spans a range of faults, the fault of the uncertainty coefficient U also appears as a reliability fault, and the data of the value range of 0-0.1 is higher than the data of the other side of the fault in terms of interval separation; the uncertainty is therefore stronger. However, the data on the right side of the partition is better than the majority of the data in the interval, which shows high reliability. In the evaluation of U, the standard score is selected as the unit of distance metric, and its value reflects the relative position of the individual in the interval, so it can better reflect the uncertain distribution of data faults in this interval. The S-evaluation uses the average distance as the metric, whose uncertainty evaluation is a smooth result, and the intra-segment differentiation cannot be detected. Therefore, in the special distribution with data faults, the uncertainty coefficient can better evaluate the deviation distribution level of individual objects in discretization.

CONCLUSIONS
In this study, a new cross-scale discretization uncertainty measure coefficient based on local standard score construction is proposed, which realizes the controllable experimental analysis of the comprehensive performance of measurement quality and computational efficiency of discretization uncertainty. Considering the individual's comprehensive contribution rate within and between discrete intervals, the distribution of discretized uncertainty is detected and evaluated. The experimental results show that the evaluation effect on the discretization reliability is consistent with the existing commonly used evaluation coefficients. Comparing the uncertainty coefficient and the existing evaluation coefficient to evaluate the operation efficiency of the discretization effect of large-volume data, we can see that the uncertainty coefficient proposed in this study significantly shortens the calculation time, and the calculation efficiency of massive data is better than the existing evaluation coefficient. In addition, the uncertainty coefficient helps to identify breakpoints and abrupt points in the dataset, which is more suitable for discretization evaluation of special distributions. Furthermore, the value normalization of the uncertainty coefficients constructed using the local standard scores varies from 0 to 1, and such evaluation results directly support the unified comparative analysis of the discrete degrees of uncertainty of different types of attributes. Potentially, the statistic of the standard fractional distribution of the standard normal distribution N (0,1) (discrete uncertainty coefficient) can develop a probabilistic theoretical analysis of the nature of the statistical (estimated) amount of data.
Since discretization results are unlikely to form a one-to-one ideal mapping for the complex real world, all types of discretization bring some uncertainty. Furthermore, when the discretization results are applied to data mining, and the Figure 5. Verification of superiority in description of uncertainty coefficient uncertainty of the previous stage is propagated to the latter stage, the result is the accumulation and propagation of uncertainty. Therefore, it is of great value to evaluate the individual uncertainty in discretization. Although some preliminary explorations on evaluating the individual uncertainty in discretization has been carried out in this study, there are still many areas for improvement in follow-up research. Discretization uncertainty assessment can be effectively applied to discretization algorithms and interval number selection of actual data.