Online Geoprocessing using Multi-Dimensional Gridded Data

Traditional geoprocessing techniques often rely on the use of multiple softwares for data handling and management which consumes almost 80% of the time and requires the user to be well versed with all the intricacies of pre-processing. Therefore, there is a need to reverse the trend on analysis and data management, so as to enable scientists and researchers to focus on the science rather than data handling and pre-processing. The concept of a Data Cube which is a massive multi-dimensional array of raster or gridded data, ‘stacks’ satellite images and addresses the problems faced by traditional remote sensing practices and provides an interactive environment where datasets can be analysed with relative ease as compared to its traditional counterparts. This framework allows multi-format and multi-projection datasets spanning decades to be used in various geoprocessing techniques from simple GIS tasks such as data conversion, time series generation, and to do more complex tasks such as change detection, NDVI generation, unsupervised classification and modelling. LISS III data for the state of Uttarakhand, India was used on an interactive interface called the Jupyter Notebook where scripts written in Python allowed data to be ingested, analysed and visualised. The Data Cube framework hence proved to be a flexible and extensive development environment which can be extended to meet more complex modelling requirements.


INTRODUCTION
Native and offline geoprocessing is hindered by few guiding principles that govern the functionality of large scale and complex datasets that requires to be processed and analysed before being accessed by users (Hofer, 2015).Traditionally, remote sensing products goes through many step-by-step procedures before being shipped out to a client, this often takes upto 80% of the total process, with less than 20% actually utilised for analysis and development (Oliver & Woodcock, 2015).Along with data interoperability and exchange, data size limitations and processing capabilities often pose a challenge to researchers bounded by using a more traditional or Desktopapproach to geoprocessing which is limited by not only hardware but also various complexities the user must attend to facilitate the data, its management and utilization in various softwares.Considering the vast amounts of Earth Observational (EO) data generated per day, there exists a large potential for data to be unstructured and more importantly not conforming to international standards (Lewis et al., 2017).To overcome such issues and vastly improve the user accessibility and scalability of EO data, a more robust and powerful framework that adheres to various standards of interoperability and allows on-the-fly geoprocessing of large amounts of EO data is required.The Data Cube, is one such framework which works on the principle of "stacking" satellite imagery in a multi-dimensional array of gridded data which overcomes such challenges.A study by Mueller et al., (2016) consisting of over 100,000 satellite images and metadata were ortho-rectified, corrected to measurements of surface reflectance and analysed for observations of water at a resolution of 25m.A project of this scale would not be possible if traditional methods of remote sensing were applied.This framework can be extended to fit various use-cases ranging from continental-scale analysis of vegetation change, species distribution modelling and understanding climatic change over long periods of time.

Online Geoprocessing technology and its attempts worldwide
One of the most profound examples of online, scalable and interoperable geoprocessing platforms belongs to that of Google.The Google Earth Engine (GEE) is a cloud-based platform for planetary scale analysis and is built upon its powerful supercomputational capabilities of petabyte scale analysis of EO data (Gorelick et al., 2017).Housing massive data catalogues which are indexed by high-performance parallel computers, GEE allows users to quickly access and analyse data from a web-browser.This technology allows a researcher to skip various hurdles of handling EO data, such as file formats, managing databases and using geospatial data processing techniques.File-based data handling mechanisms such as Hadoop Distributed File System (HDFS) and GeoTrellis works on the principle of processing large spatial queries using distributed memory abstraction techniques which enables the collection of elements in parallel (Appel, Lahn, Buytaert, & Pebesma, 2018).An alternative method is to represent EO data as multidimensional arrays and utilise such databases for not only storage but also analyses.The EarthServer project utilises such technology for use in domain of image and sensor statistics, neuro science, OLAP and high-level computing.The RasDaMan Engine is efficient in managing potentially unlimited data volumes and adheres to OGC data and service standards for interoperability.This ensures that big EO data can be handled and analysed in a cost-efficient and scalable manner (Baumann, 1999;"EarthServer.eu," 2018).
This paper is based on the work carried out by the Australian Geoscience Data Cube (AGDC) project from which the multidimensional framework is adapted.AGDC has addressed various challenges faced by using large EO data, in particular focusing on the V's of Big Data.(Lewis et al., 2017) Volume, Veracity, Velocity and Variety has been successfully addressed using the Data Cube framework upon which multiple studies have been carried out.Projects such as the Water Observation from Space (WOfS) (Mueller et al., 2016), Fractional Cover (Scarth, P., Röder, A., Schmidt, 2010), Normalised Difference Vegetation Index (NDVI) , Intertidal Extents Model (ITEM) (Sagar, Roberts, Bala, & Lymburner, 2017) and Surface Reflectance (SR) were carried out using the Landsat archive of Australia.The Data Cube framework was built, coupled with the computing facility of the National Computational Infrastructure (NCI) where petabyte-scale level EO data was orthocorrected, atmospherically corrected and analysed successfully.

STUDY AREA
The area of study in this research is the State of Uttarakhand, India which is located at the foothills of the Himalayas and often referred to as "Devbhumi"-Land of the Gods.The state is roughly 54,000 Km 2 with an elevation range of 600 to 7800 meters.It is an agricultural and tourism dependent state where the role geospatial data is crucial.The terrain and climate is very diverse with regions near the Himalayas experiencing heavy snowfall while the plains are dense with populated cities often experiencing heavy rainfall.The study area is shown in Figure 1.

Data Cube Package Installation:
The latest version of the Data Cube package can be sourced from Git and installed along with all its dependencies a few of which are listed below - Installation of Data Cube and all its supporting dependencies should be carried out on a virtual environment, so as to avoid version and library conflicts with other packages within the test system.

Metadata Preparation:
A prerequisite for data index and ingestion is the meta-data preparation phase.A metadata file usually in the format of a basic text file or XML file is often accompanied with the satellite data, but this format is not readable within the Data Cube.A specific format called a Markup Language, more precisely a YAML formatted text file needs to be generated from the data in order for the Data Cube to index a dataset.The dataset meta-data generated consists of the following description variables similar to the product definition.A few of the variables are listed below - This configuration file is unique to each scene of the product as the extents vary across multiple tiles.Thus there could be tens if not hundreds of dataset documents required to map each scene perfectly.

Database Indexing:
Indexing of a dataset into a database is the process of setting and recording an instance the data and its corresponding metadata into a temporary storage in the database.This method is carried out purely for the sake of improving the speed of data access, especially when dealing with large scale datasets of Gigabyte if not Petabyte scales.Also to ensure that any changes during data manipulation and analysis does not alter the original dataset stored in the system (Lewis et al., 2017).
The Entity Relationship Diagram (ERD) for the database indexing phase is described below in Figure 3.

Data Ingestion:
The process of inserting data into the Data Cube by mapping the dataset from its original form to a new storage schema is called as ingesting a dataset.The process is governed by many variables which dictate the meta-data and storage format before being written out to disk.The ingestion configuration file describes the following variables written in YAML formatted text.
The initial step consists of ensuring the file names to be ingested matches the files described in the ingestion configuration file.The configuration files are verified after supplying root access, after which the indexed data is verified for each scene and product indexed into the Data Cube.The next step involves the testing of the grids, spatial reference and if it matches to the reference set by the ingestion configuration.By default, the Data Cube API transforms all datasets to a standard spatial reference set in the initialization phase, this ensures that none of the datasets will be mismatched and spatially unrecognized.Compliance Phase comes to an end where the default storage unit is set and a NetCDF check is accompanied for each file to maintain the common data format standard followed by the Data Cube API.NetCDF files are interoperable over the internet along with ability to store almost limitless number of dimensions and groups.Zlib data compression standard allows such complex data to be shared across the internet at a low-cost disk/bandwidth space ("netCDF4 API documentation," 2018).Figure 4 describes the ingestion process in detail.The ingestion phase begins with the reading of the configuration file, each attribute is read and verified to exist in the index before continuing.Now the NetCDF file metadata is cross-checked with the attributes found in the configuration file and against the original (satellite order).When all the attributes, filenames and meta-data are found to be correct, the Data Cube API sets the input types, extents, time slices and dimensions along with the spatial reference onto the example NetCDF file (which is also indexed).
The final stage of ingestion is the Reprojection of each scene/tile into scaled down pixel values using the Data Cube API.This step concludes the ingestion of a scene, which is repeated for all scenes and products indexed in the Data Cube.
The ingestion process ends when all the scenes have been reprojected as per configuration and each NetCDF file is written onto the disks.

Data Load:
Loading of data once ingestion is complete is the process of query the database for the required dataset and its matching product for the indexed XArray Dataset storage format.The flowchart depicted in Figure 5 describes the process of data loading depending on the query supplied by the user.
Figure 5. Data Load Once a query is read and the polygon is calculated, all the datasets within the time-frame supplied is grouped.An output array for all of the bands is created at every time-stamp.The same process is repeated at each dataset encounter, after which the data portion is loaded from file and fused into array.Fuse operation is carried out when over-lapping regions exist in the polygons.Once all the datasets have been queried,read and fused, an output array is created to store all the datasets grouped by time and is returned as a XArray Dataset to the user.This can now be analysed and visualised using supported libraries such as Pandas and Matplotlib for various use-cases and algorithms.

RESULTS
This chapter describes the various results obtained after the implementation of the Data Cube and running various analysis on the environment.These operations were carried out on an interactive platform called the Jupyter Notebook which is a browser based interpreter and visualizer of live codes, equations and visualizations.The Data Cube can be remotely accessed via a web browser (Jupyter Notebook) and algorithms can be developed across multiple datasets without facing any complications related to data interoperability or compatibility.
The results achieved during the course of this project have been described below depicting the following:

Ingestion of Satellite Data
As described in section 4.2.7,various scripts were written to customise different satellite data along with user requirements and definitions.Data from the Linear Imaging Self Scanner (LISS) III satellite data was acquired for the study area from 2000 to 2015, consuming over 120 gigabytes and generating over 8,000 NetCDF files used by the Data Cube index operations.The scenes were indexed and ingested successfully on the hardware as described in section 4.1.1.

Accessing the Data Cube
A configuration file holds the key information required to connect to the Data Cube such as the database name, hostname or IP address of database, username and password authentic to each user.This ensures that only users with the right credentials who are authorised to access the Data Cube are allowed to query results out of it.A code snippet below shows how the datacube library is imported and the configuration file is accessed via Jupyter Notebook.

Viewing Products and Measurements
Each satellite data comes with a descriptive metadata information consisting of the imagery extents, format, platform, instrument etc. which can be viewed by the user with Jupyter Notebook.This is accessed by the Data Cube API which reads the metadata of each tile/scene and summarises band measurements and product descriptions.

Retrieving Data
The Data Cube API called to load specific product types ingested along with the area of interest described by its extents.
Along with the AOI, the resolution of the required tiles can also be specified, if resampling to a lower resolution is required the sampling algorithm along with its resolution in meters can be denoted within the code itself.

NDVI Generation
The Normalised Difference Vegetation Index was calculated using the Red and NIR bands of LISS III imagery, the Data Cube API was used to call the product and the corresponding tiles within the AOI.After NO_DATA was masked, NDVI was calculated for each tile.After NDVI was calculated, the tiles were mosaicked with the mean NDVI taken into consideration.

CONCLUSION
This study has shown that having a scalable and robust framework that can be adapted to multiple datasets without having to deal with data dependency issues, file format and projection issues.This framework paves the way to a more efficient use of large remotely sensed data which is growing immensely.Technology today has advanced to such a stage that data generation from EO satellites far outgrow the data utilization rate.As new frontiers in Machine Learning, Deep Learning, Computer Vision and Human Computer Interaction are crossed, there is a need for data to be 'Analysis Ready'.This is imperative as data from the Sentinel and GiSAT series would further push our data handling and processing capabilities, which if not addressed properly would affect the quality of research and more importantly its impact on society.Addressing issues related to not only usability and efficiency, but also the challenges faced by Big Data is where a multi-dimensional framework of gridded data can overcome when compared to the slow and tiresome process that traditional remote sensing has been following for decades.

Figure
Figure 1.Study Area 4. MATERIALS AND METHODS 4.1 Materials Required 4.1.1Hardware Requirements: The Data Cube framework was setup on a workstation PC with the latest Intel i7 6 th generation processor coupled with 16 Gigabytes of RAM

Figure 3 .
Figure 3. Database IndexingThe class PostgresDB contains the Index class which stores the reference to all users, datasets, metadata_types, products, datasets and URI and is defined by functions to initialise the database, to close the database connection as well as to enter and exit the initialisation modules.Part of the PostgresDB class is the Python specific SQLAlchemy, a database toolkit.This module is an object based relational mapper with the benefits and flexibility of SQL at complete high performance access rates and data abstraction.SQLAlchemy is the crux of the Engine class which establishes database connection as well as enables the execution of various methods of the XArray Dataset methods to carry out analysis.All changes in these database objects are only reflected in the index class and not in the stored data.

Figure
Figure 4. Data Ingestion

SyntaxFigure 6
Figure6and 7 represents the single plots of the LISS III band 3 and band 4 imagery retrieved from the Data Cube.All the scenes pertaining to the study area was mosaicked using Python and its dependencies without relying on consumer grade softwares.

Figure 9 .
Figure 9. NDVI of year 2010 Figure 9 describes the NDVI generated for the year 2010 on the Data Cube framework.The NDVI tiles were mosaicked by considering the mean NDVI of all the tiles pertaining to the study area.This operation was carried out in under 3 minutes, operating at 70% CPU and 65% Memory utilization rates.Such tasks are hardware dependent, therefore the better the hardware configuration, faster the processing.Similarly Figure 10 depicts the reclassified NDVI values to fall into 4 distinct classes, where -1 to 0 for class 1, 0 to 0.1 for class 2, 0.1 to 0.25 for class 3 and above 0.25 for class 4 this is done to better visualize the areas of low and high/dense vegetation.

Figure
Figure 10.Reclassified NDVI of 2010 5.8 Band Statistics Various operations such as False Colour Composite (FCC) generation, Mean, Median and histogram generation was carried out to test the efficacy of the Data Cube framework to handle the most common GIS tasks often carried out on traditional softwares.Once data was ingested into the Data Cube, the user

Figure 16 .
Figure 16.2010-2005 Change Detection from matplotlib import ticker plt.contour(ndvi_diff,linestyles="solid",or igin='upper') plt.contourf(ndvi_diff,origin='upper') cbar=plt.colorbar()tick_locator = ticker.MaxNLocator(nbins=3) cbar.locator= tick_locator cbar.update_ticks()cbar.ax.set_yticklabels(['NegativeChange','No Change','Positive Change']) plt.rc('font', size=8) plt.draw() plt.show() The Data Cube can handle many different types of data and for this very exact reason it is essential that the Data Cube understands the differences and nuances of each dataset and what to do with them.The product definition describes numerous variables similar to the ingestion configuration discussed before but is unique to each satellite data product.A few of the variables are listed below - : A PostgreSQL database is initialised with super user permission and a schema is generated to hold all the table values.The agdc schema consists of 5 tables, namely - dataset  dataset_location  dataset_source  dataset_type  metadata_type, as well as a login table to maintain user records.4.2.3Product Definition: