CBERS DATA CUBE: A POWERFUL TECHNOLOGY FOR MAPPING AND MONITORING BRAZILIAN BIOMES

Currently, the overwhelming amount of Earth Observation data demands new solutions regarding processing and storage. To reduce the amount of time spent in searching, downloading and pre-processing data, the remote Sensing community is coming to an agreement on the minimum amount of corrections satellite images must convey in order to reach the broadest range of applications. Satellite imagery meeting such criteria (which usually include atmospheric, radiometric and topographic corrections) are generically called Analysis Ready Data (ARD). Furthermore, ARD is being assembled into multidimensional data cubes, minimising preprocessing tasks and allowing scientists and users in general to focus on analysis. A particular instance of this is the Brazil Data Cube (BDC) project, which is processing remote sensing images of medium spatial resolution into ARD datasets and assembling them as multidimensional cubes of the Brazilian territory. For example, BDC users are released from performing tasks such as image co-registration , aerosol interference correction. This work presents a BDC proof of concept, by analysing a BDC data cube made with images from the fourth China-Brazil Earth Resources Satellite (CBERS-4) of one of the largest biodiversity hotspot in the world, the Cerrado biome. It also shows how to map and monitor land use and land cover using the CBERS data cube. We demonstrate that the CBERS data cube is effective in resolving land use and and land cover issues to meet local and national needs related to the landscape dynamics, including deforestation, carbon emissions, and public policies.


INTRODUCTION
Currently, new satellite and terrestrial remote sensing systems produce such large amounts of data at fine temporal and radiometric resolutions (Nativi et al., 2015) that new data storing solutions are dearly needed. Besides, the time required to pre-process such amount of data diminishes the time actually invested in analysis. Analysis Ready Data (ARD) and multidimensional data cubes are preconditions to fulfil new analysis demands and to increase the levels of detail and accuracy required in the a fast changing environment (Nativi et al., 2017).
The purpose of Earth observation (EO) data cubes is to organise the data to make their use so simply and intuitive that users can focus on developing and testing their methods (Appel, Pebesma, 2019). An EO data cube is a four-dimensional array relating dimensions to longitude, latitude, time, and spectral bands (Appel, Pebesma, 2019). Around the world, EO data cubes initiatives are providing open data for the common good of society (Killough, 2018).
The first EO data cube was the Australian Geoscience Data Cube (AGDC) , which still produces and distributes Analysis Ready Data made from Landsat images, enabling users to explore and increase the impact of EO data (Committee on Earth Observation Sciences (CEOS), n.d.). The AGDC runs on top of the Open Data Cube (ODC) infrastructure, which is been used to create other national and regional data cubes such as those of Switzerland (Giuliani et al., 2017), Colombia , Bravo * Corresponding author et al., 2017, Africa (Killough, 2019), Armenia (Asmaryan et al., 2019), China (Yao et al., 2018), and Mexico (Dhu et al., 2019).
Other initiatives produce data cubes by combining images from different sensors, such as the Cubesat Enabled Spatio-Temporal Enhancement Method (CESTEM) and the Framework for Operational Radiometric Correction for Environmental monitoring (FORCE). CESTEM produces radiometric harmonisation of images from Planet's constellation, Landsat, Sentinel, and MODIS (Houborg, McCabe, 2018) while FORCE harmonises ARD data from Landsat 8 Operational Land Imager (OLI) and Sentinel-2 MultiSpectral Instrument (MSI) using a software developed by Frantz (2019) .
The Brazil National Institute for Space Research (INPE) is developing the Brazil Data Cube (BDC) project to create ARD data cubes of the Brazilian territory 1 . Among others data cubes, BDC provides one made of images from the fourth China-Brazil Earth Resources Satellite (CBERS-4). CBERS-4 sensors provide medium resolution images in the visible and infrared region of the electromagnetic spectrum.
A medium resolution CBERS-4 data cube could extend and improve the scope of environmental monitoring studies such as deforestation mapping, greenhouse gas emission assessment, and forest fire detection. In addition, it may encourage the scientific community to research and develop new cartographic products since several works have established data cubes as a technology for mapping land use and land cover (LULC) (Hamunyela et al., 2016, Brooke et al., 2017, Lucas et al., 2019, Xi et al., 2019 and for monitoring both forest (Hamunyela et al., 2017, Hermosilla et al., 2018 and urbanisation (Killough, 2019).
In this study, we demonstrate the applicability of the CBERS-4 data cube for mapping LULC. Additionally, we present to the scientific community a data cube that is already available on the BDC project platform.

DATA CUBE INITIATIVES
The Australian Geoscience Data Cube (AGDC) has been producing ARD data from the Landsat and Sentinel-2/MSI satellites since 2018. Its aim is to explore the EO potential by addressing the volume, velocity, and data diversity challenges of EO big data . The AGCG applications range from understanding environmental changes, such as water availability and urban and agricultural expansion, to allow companies and industries to have access to EO data so that they can innovate and develop new products (Dhu et al., 2019).
The purpose of the Swiss data cube (SDC) is to monitor environmental changes, spatially and temporally, and to provide ARD data so that Swiss scientific institutions can innovate and generate information that improves knowledge about the Swiss environment, in addition to enabling more effective responses to problems of national importance (Giuliani et al., 2017). The Swiss Data Cube (SDC) contains optical data from the Landsat and Sentinel-2 satellites and radar data from the Sentinel-1 (Dhu et al., 2019).
The Regional Data Cube for Africa (ARDC), launched in 2018, comprises five countries: Ghana, Kenya, Senegal, Sierra Leone, and Tanzania. ARDC's goal is to provide access to EO data, in free and open data infrastructure, to address the United Nations Sustainable Development Goals (SDG), and other priorities in each country (Killough, 2019). In addition, to capacity endusers to apply EO ARD to address local and national needs. The ARDC includes Landsat ARD, since 2000, but plan to add Sentinel-1 and Sentinel-2/MSI data (Dhu et al., 2019).
The objective of the Brazil Data Cube (BDC) project, started in 2019, is to develop a platform for the analysis and visualisation of large volumes of EO ARD and harmonise data from different satellites and sensors, for the Brazilian biomes. Besides, to create LULC maps and provide support for other Brazilian projects to monitor deforestation, burning, and land use and land cover changes. The BDC consists of medium spatial resolution sensors (20-30 m) from the Landsat 8/OLI, CBERS-4/WFI, and Sentinel-2/MSI platforms, covering the entire Brazilian territory.
The Colombian Geoscience Data Cube (CDCol) aims to cover the entire life cycle of the image analysis process, and therefore provide ARD data to support Colombian institutions, which will benefit from the information to support the forest and carbon monitoring system . The CDCol initial input includes 15 years of satellite images of Landsat 5, 7, and 8.
Armenia developed and implemented the first version of an Armenian data cube, in partnership with the Swiss data cube, in order to obtain data that support the challenges that the country faces related to environmental issues and the lack of data (Asmaryan et al., 2019). The Armenian data cube includes Landsat 5, 7, 8, and Sentinel-2/MSI ARD over Armenia, from 2016 to 2019.
The China Data Cube (CDC) is being developed to meet the needs of researchers in related areas, such as monitoring changes in the ecosystem, floods, agriculture, climate, etc (Yao et al., 2018). The China Data Cube (CDC), which uses the Open Data Cube infrastructure, has inserted the China's GF1 satellite data and plans to include more China EO data in the CDC, such as HJ1A/1B, ZY and other satellites as Landsat.
The Mexico Geospatial Data Cube (MGDC) is being developed, in collaboration with Geoscience Australia, at the National Institute of Statistics and Geography of Mexico (INEGI). The MGDC will contain ARD Landsat images since 1984, but its architecture is prepared to receive Sentinel-2/MSI data (Dhu et al., 2019). MGDC product images have already been used to provide information on issues related to Natural Resources and Agricultural Statistics in Mexico (Dhu et al., 2019). The MGDC system is expected to be an INEGI's transversal service platform.

CBERS-4 DATA CUBE
The CBERS-4 data cube is one of the ARD cubes that the Brazil Data Cube project has been developing. The satellite CBERS-4 was launched on December 2014 with four sensors, Panchromatic and Multispectral camera (PAN), Multispectral Camera Regular (MUX), Wide Field Imaging Camera (WFI) and the Multispectral and Thermal Imager (IRS). Their characteristics are described in Table 1. The CBERS-4 data cube was created using images from the WFI sensor. The WFI images have 64m of spatial resolution and the raw data are processed to produce co-registered, top of atmosphere and surface reflectance ARD images (Dwyer et al., 2018). Also, the indices NDVI and EVI, and a cloud mask product are computed (Figure 1). From surface reflectance product, we generate composite images using different time periods, one month and 16 days. These periods encompass, on average, 6 and 3 observed images, respectively. For each period, after removing cloud masked pixels, we choose one value using two different strategies, the simple median and the "stack" algorithm, that prioritises the non-cloud values of images with less cloud cover. This process generated four temporal compositing products. In this study, we used images from the monthly composited product using "stack" algorithm.
All images are available on the BDC project portal. BDC ARD data is supplied as tiled products defined in an equal area projection. The spatial footprint comprehends 1 : 250,000 grid with a tile size of approximately 165×110km. All data is provided in Cloud Optimised GeoTIFF (COG) format (Cloud Optimized GeoTIFF, 2019) and described according to the Spatiotemporal Asset Catalog (STAC) (Spatio Temporal Asset Catalog, 2019) specification. The software used to build, access, and process the CBERS-4 data cubes is open source and the code is available on the BDC project's Github at https://github.com/ brazil-data-cube.

PROOF OF CONCEPT
In this section, we present an application of LULC classification. After delimiting the study area, the process consisted of preparing the data cubes, obtaining the time series for the samples, training and validating the random forest model, and generating the LULC map ( Figure 2).

Study area
The study area is located in the Cerrado biome, on the border of three Brazilian states: Mato Grosso, Mato Grosso do Sul and Goiás, Brazil ( Figure 3). The period of analysis is from September of 2018 to August of 2019.
The Cerrado is the second largest biome in South America, occupying an area of 2 million km 2 , about 22% of the Brazilian territory. Considered as a global hotspot of biodiversity (Klink, Machado, 2005), it presents an extreme abundance of endemic species but it suffers an exceptional loss of habitat. Because of its biological diversity, the Cerrado is recognised as the richest savanna in the world, sheltering at least 11, 600 species of catalogued native plants (Brazilian Ministry of Environment, 2019).

Data cube access and preparation
The CBERS-4 data cube can be accessed by the BDC portal or by STAC service, which provides information about the images ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2020, 2020 XXIV ISPRS Congress (2020 edition) as well as the links to access their contents. The service API enables automated access and preparation of the data cube. Here, we computed new indices and filled null data using temporal linear interpolation. We used the monthly composed images data cube using the "stack" algorithm.

In addition to the two indices provided by this product (Enhanced Vegetation Index (EVI) and Normalized Difference Vegetation Index (NDVI), we computed other four indices: Global Environment Monitoring Index (GEMI), Green Normalized Difference Vegetation Index (GNDVI), Normalized Difference Water Index (NDWI2), Photosynthetic Vigour Ratio (PVR).
Resulting in six vegetation indices and four spectral bands: blue, green, red, and near-infrared bands (  The CBERS-4 data cube consists of images organised by date and band. Organised by date, the images are stacked forming time series for each attribute of each pixel. To fill eventual null data in the time series, we interpolated the lacking data linearly using the closest valid values available. After this processing, we extracted the time series for each sample for the training stage and used it later to generate the LULC maps.

Sample dataset
We merged two sample dataset, one coming from high resolution images, collected by Remote Sensing specialist, and the other from the BDC project database. This dataset includes 1, 042 LULC samples divided into four classes: Natural Vegetation (NV), Pasture (P), Semi-Perennial Crop (SP Crop), and Annual and Perennial Crop (AP Crop).
To evaluate the separability of these classes in the sample dataset, we used a neural network Self-Organizing Map (SOM). SOM is a suitable clustering method when working with timeseries data (Aghabozorgi et al., 2015), it is an unsupervised neural network where the input layer is the sample dataset, and the output layer is a data set of grouped neurons.
SOM allows us to map from high to low dimension spaces while preserving the topology of the data and reducing the computational cost. SOM evaluates which spectral bands and vegetation indexes are best suited for LULC separability . The SOM parameters were: a grid size of 5 × 6, a learning rate decreasing from 0.05 to 0.01, and 300 iterations.

Random forest classification
The random forest algorithm (Breiman, 2001) is an ensemble method based on a decision tree model. Its strategy consists in developing many decision trees via bootstrap and random feature selection to reduce classification bias. The majority voting schema is used to obtain the final classification model.
The classification was done using the sits R package , an open source software developed by our research group. For the classification, we used the full depth of CBERS image time series to create larger dimensional spaces, and we set the number of decision trees to 1000. At each growing tree, only a fraction of the attributes are considered to split a node according to the Gini index, used here as an attribute relevance criterion.
To validate the resulting classification we used a 5-fold crossvalidation (Wiens et al., 2008). It ran five different assessments using 80% of the samples for training and 20% for prediction. The average accuracy of the five classifications was used to produce a single accuracy estimation.

RESULTS AND DISCUSSION
For the study area samples, the SOM clustering reduced the size of the sampling dataset by 17.8%, going from 1, 042 to 856 samples. We used this filtered dataset to train the classification model. The user's and producer's accuracy for the LULC classes mapped are presented in Table 3. The classification quality assessment using 5-fold cross-validation (Wiens et al., 2008) of the training samples showed an overall accuracy of 97.0% and a Kappa index of 0.96.
The results demonstrate that the CBERS-4/WFI data cube is not only ideal for mapping LULC but also for detecting land use and land cover change (LULCC), as it provides ARD image time series. Currently, one of the biggest worries of the Cerrado biome is the rapid change occurred in LULC (Soterroni et al., 2019). Agricultural crops and pasture areas have been expanding over natural vegetation with considerable speed . Maps produced from data cube in the ARD format can be a powerful tool for monitoring the dynamics of land use and land cover in the Cerrado.
Besides, the CBERS-4/WFI data cube can be considered better than MODIS products -widely used for mapping the Cerrado biome -because of its spatial resolution. For example,   the spatial resolution of MODIS (250 x 250 m) causes spectral mixing and limits pattern recognition in heterogeneous landscapes (Zhong et al., 2016) as the study region. For natural formations, MODIS is unable to capture complex vegetation gradients such as some sub-types of savanna (e.g., shrublands, and grasslands) (Schwieder et al., 2016), which turns cerrado areas spectrally similar to forest areas in the function of its density (Simoes et al., 2020). For other vegetation classes, such as pasturelands and agricultural vegetation the spatial resolution of MODIS may cause spectral confusion due to its seasonal variation (Chaves et al., 2018, Picoli et al., 2018. As showed in this work, these both situations can be improved by using the CBERS-4/WFI data cube. Figure 4 shows the resulting map with the spatial distribution of LULC classes, from September 2018 to August 2019, using the vegetation indices derived from the CBERS-4/WFI data cube EVI, NDVI, GEMI, GNDVI, NDWI2, PVR, and the spectral bands red, green, blue, and near-infrared, applying the random forest algorithm. The CBERS-4/WFI data cube can be used to monitor the expansion of agriculture, pasture, and urban areas, natural disasters, and to map deforestation. Projects like the Monitoring of the Brazilian Amazon Deforestation by Satellite (PRODES) (Brazil's National Institute for Space Research -INPE, 2020) that already use CBERS-4 satellite data to map deforestation across the Legal Amazon, can be benefited by incorporating the CBERS-4/WFI data cube into its classification systems. Using the CBERS-4 data cube, the technicians of this project will be able to pass up the image processing steps and just focus on detecting deforestation.
Other projects for monitoring LULC in Brazil, such as Terra-Class (Almeida et al., 2016) and MapBiomas (Azevedo et al., 2018), will also benefit from data cubes, as they will have time series data in the ARD format to map Brazilian biomes. These projects could allocate more human resources in the development of mapping. Besides the benefits already mentioned, the CBERS-4/WFI data cube can support public policies aimed to mitigate the impacts of global environmental changes. For the Brazilian context, the use of this product also can symbolise more autonomy due to the national efforts and technology implemented to develop and launch the satellite in partnership with China.

NEXT STEPS
Recently, on December of 2019, a new satellite of the CBERS program (CBERS-04A) was successfully launched carrying sensors with the compatible specifications of those inboard of CBERS-4. This launch ensures continuity of the image capturing as well as an increment in the frequency of revisits. The BDC project intends to implement in a single data cube images from both CBERS-4 and CBERS-04A.
In the context of the BDC Project, other data cubes with medium resolution are being generated and tested besides CBERS-4/WFI, such as Sentinel-2/MSI and Landsat 8/OLI imagery. Initially, the period of these data cubes will be from 2017 (the launch of Sentinel-2B) to 2021 for the entire Brazilian territory.
Moreover, the BDC research team is studying multi-sources image harmonisation strategies to produce spectrally harmonised data cubes. The BDC project has been testing atmospheric correction algorithms and developed procedures to produce data cubes containing harmonised surface reflectance images from the Landsat 8/OLI and Sentinel-2/MSI satellites. This will increase the data frequency and adequate observations to characterise highly dynamic LULC processes.

CONCLUSIONS
Because of the large volume of EO data, such as the CBERS-4 WFI satellite, which has data every 5 days since 2014, users demand a lot of time to process and organise images. Since the Brazil Data Cube project, users now have access to use EO ARD to meet their needs.
The proof of concept presented in this paper show that the LULC classification of the studied area using the EVI, NDVI, GEMI, GNDVI, NDWI2, and PVR vegetation indices, and the red, green, blue and near-infrared spectral bands is promising for LULC mapping and monitoring. The map generated from the CBERS-4 data cube provides important information that can be used to understand the LULC dynamics and to monitor human activities in Brazilian biomes.
Furthermore, the data cube provides a database to support the scientific community with a wide range of applications, such as monitoring deforestation, calculating carbon emissions, monitoring and hydrological modelling, among others.

ACKNOWLEDGEMENTS
This work was funded by the Amazon Fund through the financial collaboration of the Brazilian Development Bank (BNDES) and the Foundation for Science, Technology and Space Applications (FUNCATE) no. 17.2.0536.1 (Brazil Data Cube project). This work is also supported by Group on Earth Observations (GEO) through the Earth Observation Cloud Credits Programme.