A PARALLEL ALGORITHM FOR LOCAL POINT DENSITY INDEX COMPUTATION OF LARGE POINT CLOUDS

Point density is an important property that dictates the usability of a point cloud data set. This paper introduces an efficient, scalable, parallel algorithm for computing the local point density index, a sophisticated point cloud density metric. Computing the local point density index is non-trivial, because this computation involves a neighbour search that is required for each, individual point in the potentially large, input point cloud. Most existing algorithms and software are incapable of computing point density at scale. Therefore, the algorithm introduced in this paper aims to address both the needed computational efficiency and scalability for considering this factor in large, modern point clouds such as those collected in national or regional scans. The proposed algorithm is composed of two stages. In stage 1, a point-level, parallel processing step is performed to partition an unstructured input point cloud into partially overlapping, buffered tiles. A buffer is provided around each tile so that the data partitioning does not introduce spatial discontinuity into the final results. In stage 2, the buffered tiles are distributed to different processors for computing the local point density index in parallel. That tile-level parallel processing step is performed using a conventional algorithm with an R-tree data structure. While straight-forward, the proposed algorithm is efficient and particularly suitable for processing large point clouds. Experiments conducted using a 1.4 billion point data set acquired over part of Dublin, Ireland demonstrated an efficiency factor of up to 14.8/16. More specifically, the computational time was reduced by 14.8 times when the number of processes (i.e. executors) increased by 16 times. Computing the local point density index for the 1.4 billion point data set took just over 5 minutes with 16 executors and 8 cores per executor. The reduction in computational time was nearly 70 times compared to the 6 hours required without parallelism.


INTRODUCTION
The term point cloud is often used to indicate a set of discrete, three-dimensional (3D) sampling points that represent some spatial entities (e.g., urban infrastructure, building objects, terrain surfaces). Point clouds are usually acquired by Light Detection And Ranging (LiDAR) or photogrammetric techniquess. Point clouds are recognised as an increasingly prominent part of the spatial data infrastructure and a distinguishable kind of spatial data apart from raster and vector data (van Oosterom et al., 2015;Julin et al., 2018). An important property of a point cloud data set is its point density, which measures the number of sampling points per unit area. However, a summation of points on an apparently flat surface and the division of that number over the horizontal projection of the local surface provides an uninformative measure of density due to the high degree of heterogeneity in urban aerial point clouds and in many ways is only an indicator of the upper bound of horizontal coverage (Stanley and Laefer, 2021). Arguably, the division of the all points in the point cloud by the entire horizontal spatial extent is even less informative. Understanding a more localized nature of the density of a point cloud is important, as it has direct implications on its usability in downstream applications (Renslow, 2012). Specifically, insufficient densities can preclude tasks such as detection of objects and features smaller than a certain size. Thus, point density is a primary criterion in point cloud data acquisition and analysis (Heidemann, 2018). While there are various ways to compute the density of a point * Corresponding author cloud, such as (Lari and Habib, 2012;Bethel, 2019), computing the exact density at each location across a large point cloud data set is a computationally demanding problem, because of the need to consider the local neighbourhood of each point. Knowing the point density as a function of position with a data set can help inform the likely success of segmentation or object detection algorithms.
To address the issue, this paper introduces an efficient parallel algorithm for computing the Local Point Density index (LPD), a relatively sophisticated density metric. Compared to conventional density computation approaches such as two-dimensional (2D) gridding (Bethel, 2019), LPD is capable of capturing the density on horizontal, as well as non-horizontal, surfaces at the point level. The computation of LPD of a point cloud relies on analysing the local neighbourhood of each individual point in the point cloud. The parallel algorithm proposed in this paper aims to compute the LPD index of large point clouds in a timely manner. Firstly, this is done by enabling out-ofcore computation, thereby circumventing the need to load the entirety of the point cloud into the main memory, which removes memory capacity as the limiting factor with respect to data volume. Secondly, the algorithm is parallel, thereby allowing multiple processors to share the computing workload -leading to a significant reduction in computation time. The capability of the algorithm to compute LPD for large point clouds rapidly is particularly important to cope with the massive volumes of point cloud data being collected, which can be billions of points per square kilometer (e.g. AHN, 2014;Laefer et al., 2017;Vo et al., 2016).
One of the main challenges in developing the parallel algorithm is partitioning the data without interrupting the spatial continuity inherent to the data. The algorithm proposed in this paper is inspired by the tile-based parallelisation with the use of spatial buffers introduced by Li et al. (2017). Rather than replicating or simply adopting (Li et al., 2017), this research extends those concepts by introducing a parallel pre-processing step to structure the input point cloud into buffered tiles. In addition, the research provides an extended formalisation and in-depth analysis of the computational efficiency of the algorithm and rigorously evaluates the actual computational performance through large-scale experiments. In particular, the research makes the following contributions: • novel, point-level parallelisation approach to organise points into buffered tiles • extension of the tile-level parallelisation approach by Li et al. (2017) for computing the LPD index • analysis of the computational efficiency of the complete parallel algorithm for LPD computation • demonstration of the ability of the algorithm to compute the LPD index for a large point cloud rapidly by distributing the computation across a distributed memory system.

RELATED WORK
This section reviews research related to point density computation and parallel point cloud processing. This starts with the most simplistic such as that recommended by the US Geological Survey's LiDAR Base Specification (Heidemann, 2018), which involves computing the average density for a representative polygonal area larger than 1 km 2 . As noted by Bethel (2019), the simplistic representative area approach is insufficient to comprehensively represent the actual density of the entire point cloud, as the mechanism cannot capture local differences. More rigorous alternatives include the Voronoi/Thiessen polygon approach and the gridding approach (Naus, 2010;Bethel, 2019). Those approaches are better at capturing the local point density distribution compared to a representative area approach. However, all of the aforementioned approaches project the 3D data onto a 2D, horizontal plane, thereby obfuscating the actual point distribution, which is inherently 3D. That limitation particularly affects non-horizontal surfaces, which are rarely seen in topographic mapping, but commonly captured in terrestrial and mobile LiDAR,as well as low-altitude airborne LiDAR (Laefer et al., 2017(Laefer et al., , 2014. Aiming to address the limitation of the 2D approaches, Lari and Habib (2012) introduced three alternatives for computing the LPD in 3D. The first, referred to as the approximate approach, assumes that the local spherical neighbourhood around the point in question contains adjacent points that approximate a planar surface. The LPD index of a point is, thus, calculated as LPD = (n + 1)/(πr 2 ), where r is the radius of the spherical neighbourhood of the point in question, and n is the number of points in that neighbourhood. The approximate LPD index is available in notable point cloud software such as CloudCompare and is widely used in other research (e.g. Vo et al., 2015;Xu et al., 2018;Saglam et al., 2020;Ye et al., 2020). The other two approaches introduced by Lari and Habib (2012) aimed to verify the planarity assumption and exclude non-planar points in the spherical neighbourhood prior to the LPD computation.
Compared to the approximate approach, the accuracy of the latter approach is improved at the cost of a higher computational cost.
As LiDAR and photogrammetry technologies advance rapidly, the size and density of point clouds are rapidly increasing in both volume and density. A common approach to accelerate data analyses and accommodate large data volumes is parallel computing. Developing parallel point cloud data analysis algorithms is an active research topic. There is a large body of research work (e.g. Wu et al., 2011;Che and Olsen, 2018;Zhang et al., 2011;Bodenstein et al., 2015;Brédif et al., 2015; that has investigated both the main types of parallel systems (shared-memory and distributed-memory) for point cloud data analysis. In a shared memory system, all processors share access to the computer's memory. A parallel program that can exploit that type of system is called a multithreading program. In a distributed-memory system, each processor (called a node) has its private memory and functions similar to an autonomous computer. Individual computing nodes in a distributed-memory system exchange data through explicit communication over a network (e.g. message passing). Compared to shared-memory systems, distributed-memory systems are more difficult to program and are typically not as fast at computing due to significantly higher communication overheads. However, distributed-memory systems can be scaled much more easily and far less expensively (Kleppmann, 2017). Examples of research on multi-threading point cloud analysis algorithms include the parallel Delaunay triangulation algorithm by , the point cloud registration algorithm by Martinez et al. (2013), and the spatial segmentation algorithm by Che and Olsen (2018). The efficiency of multi-threading parallelism was demonstrated in all of those.
The use of distributed memory systems for point cloud data analysis has spawned significant quantities of research. For example, Zhang et al. (2011) and Bodenstein et al. (2015) independently introduced different spatial clustering algorithms that can exploit the high scalability of distributed memory systems. Both research groups followed the explicit parallel programming approach and used the Message Passing Interface (MPI) framework. With MPI, programmers must explicitly instruct how specific processors perform their tasks and coordinate with other processors. Major complexity such as avoiding race and deadlock must be handled by the programmer. Consequently, explicit parallel programming, such as MPI, is powerful method but complex. A simpler way to program distributed-memory systems is to employ frameworks such as MapReduce (Dean and Ghemawat, 2008). That framework abstracts away the actual complexity of parallel programming and provides a simple interface that programmers use to formulate their computational problems. Such an approach is simple and more accessible but usually less computationally efficient when compared to the explicit approach. There are two common implementations of MapReduce: Hadoop MapReduce (https://hadoop.apache.org/) and Spark (https://spark.apache.org/). Examples of Hadoop Map Reduce for point cloud data analysis can be seen in (Aljumaily et al., 2016) and (Krishnan et al., 2010). Its successor, Spark, is typically 10-100s times faster than Hadoop MapReduce due to its more efficient memory usage (Zecevic and Bonaci, 2017). In addition, Spark provides a richer interface so that applications need not follow the rigid structure of one map and one reduce function, as per the original MapReduce framework. The use of Spark for LiDAR point cloud analysis is seen in various research for a diverse range of purposes including orthographic image rendering (Brédif et al., 2015) and change detection (Liu et al., 2016) and solar radiation simulation with point clouds .
Tiling, rasterisation, and voxelisation are commonly seen in MapReduce and Spark applications for point cloud data analysis (e.g. Krishnan et al., 2010;Rizki et al., 2017;Liu et al., 2016;Li et al., 2017). These techniques group points into spatially coherent groups either in 2D (i.e. tiling, rasterisation) or in 3D (voxelisation). The groups of points (i.e. tiles, cells, or voxels) are then distributed to different processors for parallel processing. The key in applying such a strategy is in decoupling the overall computation into separate cell-or voxel-based computations without introducing spatial discontinuities or other artifacts in the ultimate results. Li et al. (2017) recommended a spatial buffer around each data partition to avoid the spatial discontinuity. The approach was demonstrated as general-purpose, as it could parallelise a class of point cloud processing tasks involving neighbouring information. Notably, the parallelisation introduced by Li et al. (2017) was at the tile level. The parallel processing was conducted on the basis of a cluster of tiles (called a subdomain), and the spatial buffer was set as the whole tile size; the input point clouds were assumed to be already structured in tiles prior to entering the parallel processing workflow. Decomposing an unorganised point cloud into tiles was not part of the algorithm. While techniques such as rasterisation and voxelisation, in addition to the use of spatial buffers, have successfully facilitated the parallelisation of several point cloud processing problems, there is not a known approach to parallelising and scaling out the point density computation to multiple nodes of a distributed-memory system. That is the topic of this paper.

METHODOLOGY
This research introduces a more effective algorithm to computing the local point density index based on data parallelism and is implemented in Apache Spark. The key operation in computing the local point density index for a point data set P is the 3D range search for each individual point in P . The search retrieves for each point pi ∈ P all neighbouring points pnj ∈ P that have d(pi, pnj) < r; where r is a user-defined radius and d(pi, pnj) is the 3D Euclidean distance between pi and pnj. Let n be the number of neighbours of pi, and r be the maximum distance between pi and all pnj. The LPD index of point pi is computed as LPD = (n + 1)/(πr 2 ). As P often consists of a large number of points (e.g. billions to hundreds of billions), the 3D range search is computationally expensive and dictates the overall computational cost of the point density determination. A conventional approach to the range search is to use a spatial data structure such as an R-tree (Guttman, 1984) or an octree (Samet, 2006). By using such a data structure, the time required to perform a range search is t search , which is proportional to the logarithm of N (i.e. the number of points in P ). Performing a range search for all points in P takes N · t search (log N ), referred to as T search in Equation 1. In addition to the 3D range search, as shown in Equation 1, the total cost of the conventional, serial approach (T serial ) consists of the costs for constructing the tree, Ttree = N · ttree(log N ) and computing the LPD, TLP D = N · tLP D . In Equation 1, ttree proportional to log N is the time required to insert one point into the R-tree, and tLP D , which is independent of the data size, is the time to compute the LPD index for one query point.
The algorithm is composed of two stages. During the first stage, the input point cloud, P , is partitioned into 2D buffered tiles that partially overlap each other (Figure 1). Each tile, τ , contains a rectangular core region, τ core , which does not overlap the core region of any other tile. Around the core region of each tile, a buffer, τ buffer , of size r is created to ensure that the data partitioning does not introduce any spatial discontinuity in the ultimate results (Figure 1a). In the second stage, as the point cloud has been partitioned into buffered tiles, the LPD index can be computed for each tile separately. buffered tiles 1 Function MapPointToTiles(pi): Both stages of the proposed algorithm are performed in parallel using multiple processors. In the first stage, each processor is given several groups of points (i.e. data partitions). The processor works on one partition at a time so that it does not have to load all partitions into the main memory. Such a solution is known as an out-of-core solution, which is suitable for large data sets that do not fit in a processor's memory. The processor maps each point pi in a partition to all buffered tiles that contain pi. That operation is performed using the MapPointToTiles function in Algorithm 1. As the tile cores are disjoint, there is only one tile, τ • j , that contains pi in its core. The index of τ • j , which is a 2-tuple index [X • j , Y • j ], is computed using the two simple rounding functions on lines 3 and 4 of Algorithm 1. In the functions, (x•, y•) is an arbitrary point defining the origin of the tiling grid and ∆x and ∆y are the dimensions of each tile. In addition to τ • j , pi can belong to several neighbours of τ • j , if the distance from the neighbours to pi is shorter than the radius r. Lines 8-12 of the algorithm check the neighbours of τ • j for those containing pi. Given that r is smaller than the tile size, a maximum of eight direct neighbours of τ • j have to be checked (Zlatanova et al., 2016). The number of tiles can be further reduced by considering position of pi relative to the centre of τ • j . For example, in Figure 1b, pi is to the North-East of the centre of τ • j . Thus only three neighbours (i.e. North, East, and North-East) of τ • j have to be checked using the tile to point distance condition. Nevertheless, the search for neighbouring tiles is always O(1), because the neighbouring tiles can be located using their 2D grid indices. For example, the 2D index of the North-East neighbour of tile τ Consequently, mapping one point to all of its tiles also takes O(1).
Apart from the point coordinates, all other inputs of the MapPoi-ntToTiles function are constant. Thus, as seen in the data lineage in Figure 2, multiple partitions of points can be processed completely and in parallel by different processors, which do not need to synchronise data with other processors. The parallelism in this stage is referred to as the point-level parallelism, as opposed to the tile-level parallelism in stage 2. Using k processors to map all N points in P to their tiles takes N k · t tiling , where t tiling is the time required to map one point to its tile and is independent of the total data size. The cost for tiling data makes up the first term (i.e. T tiling ) in the total cost of the parallel algorithm (Equation 2). Communication across processors happens at the subsequent step where the points are aggregated by the tiles to which they belong. In other words, the [τj, pi] pairs resulting from the previous map function are aggregated by the tile index τj(Xj, Yj). That aggregation is implemented using the built-in aggregateByKey function in Spark. While the aggregation is known to be highly optimised and is performed in parallel (Zecevic and Bonaci, 2017), the actual implementation is part of Apache Spark and is not managed by the authors. In this paper, the computational cost of that aggregation is treated as an unknown (Taggr in Equation 2). During stage 2 of the algorithm, the LPD index computation is performed for each separate buffered tile, τj using the Compute-LPD function in Algorithm 2. In the first step (lines 2-5 of Algorithm 2), a local R-tree is created for all points in τj to aid the 3D range search. Considering m as the number of tiles, the average time for k processors to construct all local R-trees for the point data set P is N k · ttree(log N m ), where N > N is the number of points including the duplication due to buffering. At the subsequent step (lines 7-11 of Algorithm 2), the range search is performed for every point in the core of τj using the corresponding local R-tree. The total time of the search is N k · t search (log N m ). The final step, LPD computation, takes N k · tLP D . As shown in Algorithm 2 and the data lineage in Figure 2, all operations in stage 2 are contained within each separate tile. A processor manipulating a tile only needs the point data within that tile, thereby circumventing the need to synchronise or exchange data with other processors. Notably, the computation in stage 2 is also out-of-core, since the processors only need to load the subset of tiles relevant to the current computation in the main memory. The total computational time of the parallel algorithm is shown in Equation 2. All of the terms composing T parallel are inversely proportional to k, except for the unknown Taggr . While the aggregation is not directly controlled, the efficiency of the operation, which relies on the aggregateByKey function in Spark, is observable from the relationship between reduction of the total cost when increasing k. If the aggregateByKey function is highly parallel or Taggr is insignificant compared to the total cost, T parallel should be inversely proportional to the number of processors, k.
Assuming that the differences between N and N , as well as log (N ) and log ( N m ) are negligible, the last three terms of T parallel approximate T serial /k. The first two terms are the overhead required to parallelise the computation. T parallel can be rewritten as in Equation 3. The speedup factor, T serial T parallel (Pacheco, 2011), is always smaller than k because T overhead is always larger than zero. The efficiency of the algorithm is measured by how close its speedup factor is with respect to to k.

RESULTS
The algorithm introduced in Section 3 was used to compute the 3D local point density index of a 1.4 billion point data set acquired over a part of the city centre of Dublin, Ireland in 2015 by Laefer et al. (2017). The LPD index results of the entire data set are shown in Figure 3a. The red polygon indicates the Defined Project Area (DPA), in which the minimum point density was established as part of the project specifications. The entire coverage, including the partially covered area, is called the Buffered Project Area (BPA). The DPA and BPA are established concepts defined in the US Geological Survey's LiDAR Base Specification (Heidemann, 2018). Figure 3b presents a closeup, 3D view of a small region around Dublin Castle. As expected, the 3D LPD index computation automatically adapts to the 3D surface orientations and captures the point density on various surfaces such as the ground, building roofs, and facades. Points reflected from trees and transient objects such as moving people appear as low density points in dark blue colours. This is an inherent feature of the 3D LPD approach, which is unavailable in 2D approaches.  The blue and dotted, orange histograms in Figure 4 represents the point density distributions in the BPA and the DPA, respectively. The proportion of lower density points in the BPA is higher than that in the DPA because the BPA includes partially covered areas. Each histogram has two peaks. The higher peak (around 280 points/m 2 ) represents the typical point density on horizontal surfaces such as the ground and building rooftops. Notably, the density value resulted from the 3D LPD approach is always lower than those computed by conventional 2D approaches because the area of a 3D inclined surface is always larger than the corresponding 2D projection of the area on the ground. The majority of points contributing to the second peak (around 30 points/m 2 ) are those captured on building facades and other vertical surfaces. The density histogram is an effective means to provide insight into the detailed point density distribution. For example, the River Liffey is clearly visible in Figure 3a as points captured on water surfaces are sparse due to specular reflection and refraction. Several flat grassy areas also clearly stand out in yellow due to their high point density.
In Figure 3b, the vast density difference between the roof and facade surfaces of Dublin Castle is obvious. The accuracy of the LPD computation can be straightforwardly evaluated using sampled hand calculations or reference solutions from software such as CloudCompare (Girardeau-Montaut et al., 2005). The remainder of this section focuses on evaluating the computational efficiency of the proposed parallel algorithm.
To evaluate the performance of the proposed algorithm, the runtime of the LPD index computation was measured with various numbers of processors. Notably, the concept of processors in Section 3 corresponds to two different concepts, executors and cores, in the implementation of Apache Spark. An executor is a software process residing on a physical node of the cluster and does not share the memory with other executors. An executor can execute multiple tasks using multi-threading. The maximum number of parallel tasks per executor is referred to as the number of cores. Importantly, the concept of core in Spark that is used in this paper does not refer to the physical CPU cores. The experiments were conducted using two different computing clusters. The specifications of the clusters are shown in Table 1. The first cluster from New York University (NYU) was an 18-node, high-end cluster with a high CPU core count (2×24 cores) and a large memory (384 GB). The second cluster from University of Genoa (UniGe) was a cluster of 8 commodity (consumer-grade) nodes, each had 1×6 CPU cores and 16 GB of memory. The purpose of experimenting with the two types of clusters is to understand any limitation of commodity clusters. As the proposed algorithms rely on the spatial structure of the data (i.e. tiles, buffers), the initial arrangement of the input data affects the computational performance. In all experiments reported in this paper, the input point clouds were structured as 500 × 500 tiles, and the points within each tile were ordered by their timestamps as they were delivered from the flight vendor (Laefer et al., 2017) All experiments were performed with r = 0.5m and δx = δy = 5m. The selection of the radius parameter, r, is dependent on the data characteristics and target applications. The tile size parameters, δx and δy, do not affect the density results and depend on the amount of memory allocated to the cores. A large tile size would overload the core memory, while a small tile size would increase the data redundancy due to buffering, increase communication costs, and, thus, increase the computing time. The experiments' runtimes were measured, as reported in Table 2 and Figure 5. The reported runtimes include the costs for data input and output from the Hadoop Distributed File System, in which the data resided. The runtimes were inversely proportional to the number of executors in all experiments. The only exception was the cessation of runtime reduction when increasing the number of executors from 8 to 16 in the UniGe cluster. As the UniGe cluster had 8 compute nodes, requesting more than 8 executors did not yield a further speedup. Another aspect related to the capacity of the UniGe cluster is that the cluster was configured to allow a maximum of 4 cores per executor, while up to 8 cores per executor could be requested from the NYU cluster. Nevertheless, when the clusters were not pushed beyond their capacity, T parallel reduced as k increased. The observed relationship between T parallel and k demonstrated that the unknown cost of data aggregation (Taggr in Equation 2) did not impede the efficiency of the proposed algorithm. With respect to the relative performance of the two clusters, the smaller, commodity UniGe cluster was surprisingly faster than the NYU cluster consistently in all of the tests, except for those beyond the cluster's capacity. This is probably attributable to the fact that during the time of the testing, the NYU cluster conducted over 75 other concurrent jobs. In contrast, the only job running in the UniGe cluster was the point density computation. Thus, resource sharing (e.g. disks, network bandwidth) is a likely factor contributing to the lower computational speed of the NYU cluster. The efficiency of the proposed algorithm was further analysed by computing the speedup factors, S = T serial /T parallel . The factor represents how much faster the parallel computation is with respect to the serial computation. In this research, an actual serial algorithm as described at the beginning of Section 3 was not implemented. Instead, a pseudo value of T serial was NYU Cluster UniGe Cluster Figure 5. Runtime used by running the parallel computation with 1 executor × 1 core. The speedup factors with various numbers of executors and cores are reported in Table 3 and Figure 6. Given 1 core per executor (i.e. the solid lines in Figure 6), the speedup was nearly linear. Namely, from the NYU cluster experiments, the speedup factor was 14.8 when the number of executors increased by 16 times. The efficiency factor, in this case, was E = S/k = 14.8/16 = 0.93. As E = 0.93 ≈ 1 indicates that the speedup of the parallel algorithm was nearly ideal at 16 executors × 1 core/executor. If the efficiency factor equals one, the computation is said to have a linear speedup. A linear speedup is usually very difficult to achieve in practice (Pacheco, 2011). As the number of cores per executor (hence, the total number of cores) increased, the efficiency reduced. The reduction in efficiency is seen in Figure 6a as the dotted and the dashed lines, which correspond to the experiments with 8 and 4 cores/executor. With 16 executors × 8 cores/executor, the efficiency was E = 69.2/(16 × 8) = 0.54. While the reduction in computational time by almost 70 times was great, the speedup was less significant, compared to the increase in the computing capacity of 16 × 8 = 128 times. The reduction in efficiency in relation to the increase in the number of processors (or cores) is well known and appears in almost every parallel program. The reason is that the presence of more processors translates to additional data transmission across the network. In addition, having more processors usually incurs a higher overhead for coordinating and synchronising the processors. With respect to the experiments conducted using the UniGe cluster, the speedup was almost linear in all cases when the cluster capacity was not exceeded. In the experiments that allocated 1 or 2 cores per executor, the speedup factors obtained from the UniGe cluster were similar to those from the NYU cluster. When the maximum of 4 cores were allocated per executor, the UniGe cluster's speedup was lower compared to that of the NYU cluster. For example, given 8 executors × 4 cores, the speedup factor in the NYU cluster was 25.7, while the UniGe cluster's factor was 20.6.

CONCLUSIONS
This paper presents an algorithm for computing the 3D local point density index for point clouds. The algorithm consists of two stages. In the first stage, points are aggregated into buffered tiles, and the computation is parallel at the point level. In the second stage, the tiles resulting from stage 1, are received and the LPD index is computed in parallel at the tile level. The algorithm is particularly suitable for processing large point clouds as it supports out-of-core computation and provides a high level of parallelism. An implementation of the algorithm was introduced using Apache Spark, a general-purpose distributed computing platform. The implementation can employ multiple nodes and cores of a distributed memory computing cluster to perform the LPD computation rapidly. Computing the LPD index for a data set of 1.4 billion points required just over 5 minutes with 128 cores (16 executors × 8 cores/executor). The efficiency of the algorithm was rigorously evaluated through two sets of experiments conducted using both a high-end cluster and a commodity cluster. While the larger, high-end cluster had a greater capacity, it was consistently, slightly slower than the smaller, commodity cluster, because the resources of the high-end cluster were shared with many other concurrent computations. In contrast, the commodity cluster was not shared during the experiments.
The efficiency of the parallel algorithm was successfully demonstrated in both sets of experiments. An efficiency factor of 14.8/16 was achieved when 16 executors × 1 core/executor were used. The efficiency reduced to 69.2/128 with 16 executors × 8 cores. The 3D LPD index computed by the algorithm is an effective means to evaluate and subsequently visualise the 3D distribution of points in a point cloud. The LPD index allows discerning the point density on surfaces that have different orientations such as building facades, roofs, and ground surfaces. While the algorithm was demonstrated using an aerial laser scanning point cloud, the algorithm is applicable to any kind of point cloud, including aerial, terrestrial, and mobile laser scanning data, as well as point clouds derived from photogrammetry data. In addition, the proposed algorithm could be straightforwardly extended to compute other local geometric features of a point cloud such as normal vector, surface roughness, and curvature. Future research should investigate those possibilities.