A COMPARISON BETWEEN THE HADOOP AND SPARK DISTRIBUTED FRAMEWORKS IN THE CONTEXT OF REGION-GROWING SEGMENTATION OF REMOTE SENSING IMAGES
Keywords: Remote Sensing, Image Segmentation, Distributed Processing, Mapreduce, Hadoop, Spark
Abstract. This work follows a line of research dedicated to the parallelization of image segmentation algorithms on distributed computing environments, which is motivated by the increasing resolutions and availability of Remote Sensing (RS) images. Here we focus on region-growing segmentation, which is regarded as a time consuming and demanding approach in terms of computational resources. Its parallelization is a complex problem since it usually affects the final outcome in comparison to what would be delivered by a sequential solution. This is due to the fact that subdividing an image to perform segmentation of its tiles concurrently usually introduces undesirable artifacts near to the borders of the image tiles. Additional processing steps are then required to properly stitch together the segments alongside tiles borders in order to eliminate such artifacts. In this work we evaluated alternative implementations of a previously proposed region-growing distributed segmentation approach, which was originally built on top of the Hadoop distributed computing framework. We developed a new implementation of the approach, which was built with the Spark framework, and compared its performance with that of the original implementation. In this investigation RS images of various sizes were processed using different configurations of a physical computer cluster. We evaluated computational performances and accessed the differences among the segmentation outcomes generated by the alternative implementations. We also assessed the stability of the implementations by comparing the segmentations produced with different cluster configurations. Although the approach is, in principle, suitable to any region growing algorithm, the experiments were performed with a particular segmentation method, and the results showed that the Spark implementation consistently outperformed the Hadoop counterpart, bringing in most cases a significant improvement in terms of processing time. The experiment results also attested the stability of the distributed segmentation approach, as very similar results were produced with the alternative implementations, running on different cluster configurations.