HIGH PERFORMANCE COMPUTING FOR DSM EXTRACTION FROM ZY-3 TRI-STEREO IMAGERY

ZY-3 has been acquiring high quality imagery since its launch in 2012 and its tri-stereo (three-view or three-line-array) imagery has become one of the top choices for extracting DSM (Digital Surface Model) products in China over the past few years. The ZY-3 tristereo sensors offer users the ability to capture imagery over large regions including an entire territory of a country, such as China, resulting in a large volume of ZY-3 tri-stereo scenes which require timely (e.g., near real time) processing, something that is not currently possible using traditional photogrammetry workstations. This paper presents a high performance computing solution which can efficiently and automatically extract DSM products from ZY-3 tri-stereo imagery. The high performance computing solution leverages certain parallel computing technologies to accelerate computation within an individual scene and then deploys a distributed computing technology to increase the overall data throughput in a robust and efficient manner. By taking advantage of the inherent efficiencies within the high performance computing environment, the DSM extraction process can exploit all combinations offered from a set of tri-stereo images (forward-backword, forward-nadir and backword-nadir). The DSM results merged from all of the potential combinations can minimize blunders (e.g., incorrect matches) and also offer the ability to remove potential occlusions which may exist in a single stereo pair, resulting in improved accuracy and quality versus those that are not merged. Accelerated performance is inherent within each of the individual steps of the DSM extraction workflow, including the collection of ground control points and tie points, image bundle adjustment, the creation of epipolar images, and computing elevations. Preliminary experiments over a large area in China have proven that the high performance computing system can generate high quality and accurate DSM products in a rapid manner.

One of the predominant features that makes ZY-3 so popular is that it offers the potential to extract high quality DSM/DEM (Digital Elevation Model) products from its tri-stereo (also called three-line or three-view) imagery (Li, 2012).A national-scale surveying project in China (SBSM, 2015) (Baike, 2015) acquired and demonstrated the use of ZY-3 tri-stereo imagery.Researchers have reported on extracting DSM from ZY-3 images (Zhang, et al., 2012) (Xue, et al., 2015) (Guo, et al., 2015).There is an increasing demand for national-scale surveying projects using ZY-3 tri-stereo imagery.In order to provide full coverage of the entire territory of China using ZY-3 imagery, more than 30,000 scenes must be acquired and processed.Processing this large volume of scenes in a timely manner within a project is very challenging.Users are increasingly seeking to lower their processing costs, while sensors continue to be more complex.They still want to achieve high quality DSM/DEM products (i.e., higher resolution, and less noise and abnormal values, and remain competitive with the free 30m ASTER DEM (ASTER, 2015) and 90m SRTM DEM (USGS, 2015b)).A system which offers high accuracy and high volume output while maintaining high quality standards entails a significant achievement of high efficiency in production processes.Traditional photogrammetric workstation software systems have been severely challenged, due to their sequential computing software architecture which cannot take advantage of the latest computer hardware architecture (multiple or many cores system) to gain computing power.Traditional photogrammetric workstations would require significant and time consuming upgrades to compete with the high performance computing system.Hence, solutions were sought from the high performance computing technologies, among which parallel computing and distributed computing are outstanding.The choice of parallel computing is rooted from the parallel computation nature in most image processing algorithms (Stevenson, et al., 2000) (Merigot & Petrosino, 2008).Distributed computing from another perspective could simultaneously execute computation from different tasks and hence increase the overall data throughput in a limited time (WikiPedia, 2015e) (Deng, et al., 2014a).
Since a set of tri-stereo imagery has advantages over the traditional dual-stereo imagery (i.e., stereo pairs): a wider viewing angle (forward-backward), and two regular viewing angles (forward, backward when combined with the nadir viewing angle) in order to reduce occlusion (Pollard, et al., 2010) (Krishma, et al., 2008), the high performance system is ideal to gain efficiencies when processing all potential combinations which require less computing resources.
The remainder of the paper is organized as follows.Section 2 reviews the technologies for high performance computing in the remote sensing domain.Section 3 then renders the ideas of architecting a system that could deploy both parallel computing and distributed computing to the DSM extraction process.Section 4 reports preliminary experiments over a large region in China in order to demonstrate the computing performance and accurate results derived from the system.

Parallel Computing
The performance of parallel computing is generally associated with two issues: one is the software implementation and the other is the hardware configuration.
The issue of software implementation examines how a traditional sequential algorithm is converted to a parallel one.Early research (Pachowics, 1989) examines how to convert sequential image processing algorithms into parallel ones for a simple sequential minicomputer.The first approach proposed in this research was to pack more than one image pixel (particularly for binary and grey level images) into a single type of data to reduce computation.The research then categorized the image operations into a low (e.g., logical, arithmetic and shift) level and a high level type (e.g., choice of neighbouring pixels, conditional operations, counting operation, etc.) to take advantage of the packing of pixels in parallelization.A lot of research was done using various strategies to parallelize the sequential algorithms (Stevenson, et al., 2000) (Jones, et al., 2003) (Akoguz, et al., 2013) (Bhojne, et al., 2013) (Plaza & Chang, 2007).Those various implementations require image processing domain knowledge from scientists and computer knowledge from computer engineers, which creates a gap between image processing and high performance (Seinstra, et al., 2002).The researchers (Seinstra, et al., 2002) initialized a software architecture with an extensive library of data parallel low level image operations (assuming the computing environment to be homogeneous distributed multi-computers).Although such initiatives could remove the cumbersome implementation work from image processing researchers, the efficiency of using the parallel libraries yet remains in hands of the users who are required to have comprehensive knowledge of the library and associated domain knowledge (Seinstra, et al., 2002).Similar efforts are reported in a few practices that targeted making parallel image processing algorithms as open sources (OpenCV, 2015) (OpenVIDIA, 2015).OpenCL (Open Computing Language) is a project in computer science to provide a generic software tool to implement parallel algorithms in various areas including image processing (Khronos, 2015).An advantage of OpenCL is its abstraction to various hardware platforms.An OpenCL-supported programming tool compiles a program with parallel instructions into a parallel program against a particular platform for the distribution of executables.
The issue of hardware configuration concerns what computing architecture for parallelization is implemented and how fast the hardware can process floating-point data, e.g., every second (usually in a unit of GFLOPS/TFLOPS/PFLOPS -Giga/Tera/Peta Floating-point Operations Per Second).
There are two computing architectures for parallelization: the SIMD (Single Instruction, Multiple Date streams) and MIMD (Multiple Instructions, Multiple Data streams) architectures according to Flynn's taxonomy (Flynn, 1972).In the SIMD architecture, many computing units simultaneously execute the same instruction with their own data which can be different from each other.For example, the supercomputer with vector processors decades ago supported SIMD, and the recent popular General Purpose GPU (GPGPU, which is still referred to as GPU hereafter in this paper) cards achieve impressive performance by implementing the SIMD instructions (Tsuchiyama, et al., 2012).
The MIMD architecture executes different instructions with different data simultaneously.The present multiple-core CPU system is an implementation of MIMD.Note that a memory type for data accessing when executing either SIMD or MIMD instructions is critical for achieving the best performance in parallel computing and the choice of a memory type (e.g., distributed or shared) leads to different hardware products in a computer system (Tsuchiyama, et al., 2012).
The speed for floating-point data operations is considered as an indicator of how well the hardware can support parallelization.
When increasing the clock cycles in a single core slowed down in the early part of the 21st century, development of multiple cores in a single CPU takes the majority of the responsibility to increase computing speed (WikiPedia, 2015d).A commercially available four-core CPU can achieve about 40 GFLOPS (more or less subject to manufacturing technology) in comparison with a single core CPU which has about 10 GFLOPS (WikiPedia, 2015b).Yet, the pace for increasing the number of cores in a single CPU is slow (Lee, et al., 2011) (WikiPedia, 2015d).The GPU technology has proven that it can handle the same tasks at a much faster pace to offer significantly more cores in comparison with CPU (Owens, et al., 2007).For example, a lowend commercial GPU from NVIDIA's GeForce GTX 660 offers 960 cores and is able to achieve a peak of 1,800 GFLOPS (single precision), while a high-end GPU from NVDIA's K10 offers about 3,000 cores and can reach up to 5,000 GFLOPS (single precision) (WikiPedia, 2015c).The GPU demonstrates an impressive advantage in boosting performance for parallel computing.The GPU always requires a CPU for data communication from storage to memory (either from the device or from the host) and hence results in a data feed bottleneck in high performance computing.APU (Advanced Processing Unit) technology has recently emerged in order to overcome the communication issue between host and device (AMD, 2015).Sharing the same fast data bus between CPU and GPU, performance of a parallel computing algorithm should be improved (Daga, et al., 2011).Although the APU technology is still in its infancy and the mainstream of CPU/GPU manufacturers have not taken this path, it can be predicted that this would have great potential to speed up performance in parallel computing and its applications in the remote sensing domain.Cell Broadband Engine (Cell) is another hardware solution to combine powers from both CPU and GPU-like units, which has seen successes in gaming applications etc., but it poses challenges to developers due to the architecture of its memory usage (WikiPedia, 2015a).
FPGA (Field-programming Gate Array) is one solution that combines the software implementation with a configurable hardware.It has a pool of parallel image processing algorithms and those algorithms can be chosen by switching gates in the hardware device.This configurable hardware device enables users to select the desired algorithms for a particular application and hence achieves a high performance of computing.It has advantages due to its compact size so that it can be embedded onboard a sensor with a flying platform (El-Araby, et al., 2009), while it has restrictions due to the limitation of functions available on one FPGA unit regardless of the reconfigurable nature of those functions (Lee, et al., 2011).Experimental tests (Asano, et al., 2009) were done to compare the performance of CPU, GPU and FPGA and showed that the performance of GPU can match that of the FPGA even without careful tuning of the GPU programs.

Distributed Computing
Distributed computing is achieved through an infrastructure that combines computing power from a large number of processing nodes, which are often geographically located in different places.
The distributed computing infrastructure is a high level system that acts as a virtual organization from different entities (Hager & Wellein, 2010) (Lee, et al., 2011).The history of the development and application of distributed computing systems can be traced back to more than a half century ago (Lee, et al., 2011) (Sun, et al., 2010).There are a number of distributed computing types available for research and industrial applications such as grid computing, cloud computing, and volunteer computing etc. (Hager & Wellein, 2010).
The essence in distributed computing for high performance is its scalability that either aggregates a huge number of potential computing power units from an entire region, a whole country and even the world or increases computing power by adding new processing units locally.The applications of distributed computing in science and industries are tremendous.The paper (Lee, et al., 2011) provides a thorough survey of available distributed computing systems for remote sensing applications, including Matsu, GENESI-DR and GENESI-DEC, G-POD, GMSEC, GEO Grid, and GEOSS, etc.Some other applications of distributed computing in remote sensing are discussed in the book (Plaza & Chang, 2007).

Workflow Design
A general workflow for extracting DSM is described in the diagram below (Figure 1).
The key functions in the workflow are: Importing.It performs data format conversion from a commercial data format to an internal one, and then builds relevant metadata and initiates a geometric math model.Each scene can be imported independently.A tri-stereo scene consists of three images and each image may be imported without dependency.
Collecting GCPs (Ground Control Points).The core algorithm is about matching a target image to a reference image, which leads to extracting accurate geolocation information from the reference image for a number of points (i.e., GCPs) in the target image.
Those GCPs are then used to refine the initial math model via removing geometric dislocation with respect to the ground and distortions between pixels.
Collecting TPs (Tie Points).All scenes are divided into subgroups according to their spatial relationship, provided that a minimum spatial overlap be kept between two neighbour subgroups.TPs are then collected between two images for all spatially overlapped image pairs in a subgroup.Bundle adjusting.After TPs are collected, all images are consolidated into a single group and the math model of each image is adjusted for better accuracy and with evenly distributed errors.
Extracting DSM.DSM is then extracted from each stereo pair after executing the following functions: creating epipolar images, searching common points, computing parallax and then elevation, post-processing and interpolation.
Merging.Three DSMs generated from three combinations of one scene are merged into a single DSM according to their image matching scores, in which the highest score means the highest reliability of the computed elevation.

Distributed Computing
All functions except the bundle adjustment in the above workflow can be executed against an individual image or a pair of images.This means that each image or a pair of images can be distributed to any computing node and the processing of that image or pair is independent of the others.This phenomenon fits very well into a typical scenario that can be resolved by a system with distributed computing capability.A system with distributed computing capability allows executing multiple processes simultaneously not only in a single computing node (should it have the capability of running multiple processes) but also in a group of computing nodes.For its simplicity in implementation and advantage of having redundancy naturally, the computer clustering technology (WikiPedia, 2015e) (Deng, et al., 2014a) is an optimal choice for implementing the distributed computing capability.

Automation
Human interference should be avoided or should be reduced to the minimum if the full capability of a distributed computing system can be employed.Therefore, automation of the above functions is critical.The bottleneck often occurs in the process of collecting GCPs because trained technicians are usually required to finding GCPs from reference data.Fortunately, image matching technology has matured (Zitova & Flusser, 2003) (Wyawahare, et al., 2009).As long as there is sufficient similarity between two images, it is possible to choose one or certain specific algorithms to match one image against another automatically and hence GCPs can then be collected.With orthorectified imagery having become pretty common nowadays, a proper image matching algorithm should be able to automate GCPs collection from a reference image.The scenario is much simpler in automating TP collection because the images to be matched in this scenario are generally similar (usually coming from the same satellite and acquired almost at the same time).All the rest of functions (bundle adjustment, DSM extraction and merging) do not require any human intervention.

Parallelization of GCP and TP Collection Algorithms
There are many image matching algorithms that have been developed (Zitova & Flusser, 2003) (Wyawahare, et al., 2009) and most of them require the following two critical steps: 1. Extract features from both reference and target images; 2.

Matching features. The computation of extracting features in
Step 1 in most algorithms is generally localized in a small neighbourhood and hence the feature extraction algorithm can be made highly parallel.The most time-consuming work when matching features is to compute the similarity between the feature from the reference image and the candidate features (many) from the target image.After a similarity measure (e.g., the minimum Euclidean distance) is defined, the computation of similarity among pair-wise features can be parallelized.Because a GCP/TP searching process involves numerous operations of feature extraction and matching, the performance gain in those computations is tremendous when parallelized.

Parallelization of DSM Extraction Algorithms
After images are properly georeferenced and aligned with each other, the main steps for extracting DSM from a pair of stereo images are first to build epipolar images, then to match common points between a pair of epipolar images, to compute parallax and elevation, to interpolate elevations in a raster grid from the points that have elevation extracted, and to filter the rasterized elevation values as the last step.The computation in building epipolar images is to align two images line by line and the alignment of one line is independent of another, which can hence be parallelized.Both matching points along an epipolar line and the computation of parallax and elevation happen in a small neighbourhood and can thus be executed simultaneously as well.
Interpolation from known points depends on the selected points that are most likely near to those points to be interpolated, so the computation for elevation interpolation in a raster grid can be parallelized.Filtering (linearly) is about the convolution of an image with a small size kernel that is a typical local operator like most image processing algorithms, which can certainly be carried out in parallel.Parallelization of those computations in the above steps may result in a great gain in performance.

Hardware Configuration for High Performance Computing
There are a few practices reported (Zhang & Huang, 2011) (Philip, et al., 2011) (Dai & Yang, 2011) (Fang, et al., 2014) to use GPU plus CPU to maximize the parallelization performance for a couple of reasons.The first reason is to use CPU to control multiple GPU devices if available.The second reason is to use CPU to control conditional flows and execute the less efficient parallelization code or the portion of parallelization that requires large I/O, while GPU takes the tasks with highly efficient parallelization code and less demanding of I/O.
The performance of storage (i.e., hard drive) and network may dampen maximizing the performance of parallelization generated by GPU plus CPU.This is particularly true for the processes that require intensive reading and writing of image from/to storage.A few strategies can be used to mitigate restrictions from hardware and network: 1. Use a high performance hard drive as a local disk for a processing node so that the data swapping speed between memory and hard drive does not cause too much latency for parallelization; 2. Use a high speed network so that the data transferring speed over network will have minimum latency for parallelization; 3. Minimize the frequency of data transferring over network by first preparing data in a local disk, processing the data in the local disk, generate intermediate data in the local disk, and archive the results to the central storage at the last stage.

Implementation
The above solution is implemented in a high performance processing system.It uses a clustering technology to manage computing resources (nodes) and distribute processes to the desired computing nodes.Four key modules are implemented: Data Ingestion, GCP collection, Bundle Adjustment, and DSM Extraction.The Data Ingestion process takes all scenes and distributes them to all available nodes so that each image can be imported simultaneously.The GCP Collection process also distributes all ingested images to all available nodes and collects GCPs for each image against a reference image independently and simultaneously.The Bundle Adjustment process divides all images into a few groups and then distributes each group to the available processing nodes for parallel TP collection, followed by consolidating all collected TPs and adjusting math models accordingly.The DSM extraction process handles each scene (three images) simultaneously by building epipolar pairs, matching common points and computing elevation, and then fusing individual DSMs into a final DSM.
The experiments below were conducted using the above implemented system to demonstrate the processing efficiency and capability of producing quality DSM products.

Test Data and Computing Environment
The study region for performance comparison covers an area of about 25,600 km² (i.e., 160km x 160km) in the centre of China.
A total of 17 ZY-3 tri-stereo scenes or 51 images were selected according to the following criteria: minimum or free clouds and haze, acquisition dates between spring and fall, which was chosen to minimize the impact from lands/human build-ups covered by snow/ice.The reference image used for collecting GCPs is from Landsat 7 archives (before 2000) (USGS, 2015a).
A different set of ZY-3 scenes was used for DSM accuracy assessment due to limitation of accessing high-resolution reference image and DEM data.This data set consists of three ZY-3 scenes, and each scene has landscapes of flat (at sea level of 80 meters), mountain (at sea level of 1,200 meters) and hill (or mixture of flat and mountain, at sea level of 800 meters) respectively.The reference image used for collecting GCPs has 2 meters planimetric accuracy at 2 meters resolution; the reference DEM used for extracting elevations for collected GCPs is from SRTM at 90 meters resolution; the reference DEM used for accuracy comparison has 3 meters vertical accuracy at 5 meters horizontal resolution.
The computing environment for performance comparison used a desktop computer configured with one Intel Core i7 (3 rd generation) processor (3.4GHz), 32GB memory, a commercial GPU card (NVIDIA GeForce GTX 660 with 1GB device memory), and three hard drive disks (7,200 RPM).The i7 CPU processor have four physical cores only but can be hyperthreaded into eight logical cores.Multiple processes (up to eight in general) from the above implemented system are allowed to be executed simultaneously in this single desktop environment.Particularly, eight simultaneous importing processes are allowed for its low cost in CPU but high in I/O; six simultaneous GCP/TP collection processes are allowed considering its intensive usage of both CPU and I/O; five DSM extraction processes are allowed when extracting DSMs for achieving a somewhat balanced usage between CPU and I/O.

Tests and Results
Computing performance is compared between a manual workstation system and the above high performance system.In the manual workstation system, each scene was processed (e.g., GCPs are collected manually) sequentially and hence the total processing time was calculated from the time for the single scene multiplied by the number of images (or pairs); in the high performance system, the time was the actual time elapsing for processing all scenes since multiple scenes can be processed simultaneously.(below) summarizes the timings measuring the performance from both the high performance system and the traditional one when generating a 30m DSM using all 51 images for the entire study region.The time required in the high performance system is more than 10 times faster than that required in the traditional one.Since the majority of this area is mountainous and human buildups are rare, the generated DSM is assumed as an approximate DEM for that area (without extending the topic of this paper to discuss the conversion of DSM to DEM).It is noted that the quality of the merged DSM is also better.Below figures show the quality comparisons between the merged one against those results from all individual pairs.It can be seen in Figure 3 that the merged one picks the best part (see Figure 4) from the forward-backward result (which is usually inferior to the results from the other two pairs in most areas) but also improves it slightly (smoother).The results from the backward-nadir and forward-nadir pairs have much more noise (which could be associated with blunders when matching) than the forwardbackward result (see Figure 5 and Figure 6).

CONCLUSION AND DISCUSSION
The proposed solution and its corresponding implementation take advantage of high performance computing from parallel and distributed computing.The parallel computing accelerates the speed of processing individual scenes by leveraging the parallel computing power from multiple/many cores hardware.The distributed computing allows processing of multiple scenes simultaneously and hence achieves performance gain almost proportional to the number of processes (and processing nodes).
The preliminary experiments conducted in this paper show that the performance of the high performance system is increased tenfold over traditional manual workstation-style systems.
With computing performance gained through the high performance system, more combinations of the ZY-3 tri-stereo images can be processed and their results may be fused in order to generate more accurate and higher quality DSM products versus those from a single stereo pair.The initial experiments conducted in this paper are confirmation of our research hypothesis, and more research is required.It is expected to operate on a much larger data set, covering at least an area of multiple provincial territories within China, with data provided by the data provider after the DSM extraction function in the system (deployed in a cluster of multiple processing nodes) is fully operational.Future experiments are currently planned and they will provide more information to confirm the performance gain and the quality and accuracy of DSM results reported in this paper.

Figure 1 .
Figure 1.Design of DSM extraction workflow (in which F stands for forward, B for backward, and N for nadir)

Figure 2 .
Figure 2. Merged DSM of the entire testing region Figure 2 depicts the DSM result covering the entire testing region, rendered in a coloured shade relief fashion.Table1(below) summarizes the timings measuring the performance from both the high performance system and the traditional one when generating a 30m DSM using all 51 images for the entire study region.The time required in the high performance system is more than 10 times faster than that required in the traditional one.

Figure 3 .
Figure 3.A sample from a merged DSM

Figure 5 .
Figure 5.A sample from the backward-nadir DSM

Table 1 .
Performance comparison between the high performance system and a traditional one

Table 2 .
RMS of accuracy (elevation in meters) when comparing the results from each stereo pair with a reference DEM