SENTINEL-2 SURFACE REFLECTANCE PRODUCTS GENERATED BY CNES AND DLR: METHODS, VALIDATION AND APPLICATIONS

To allow for a robust and automatic exploitation of Sentinel-2 data, Analysis Ready Data (ARD) products are requested by most users. The processors of ARD products take care of the common burdens necessary for most applications, that include precise orthorectification, cloud detection and atmospheric correction steps, as well as the generation of periodic syntheses of cloud free surface reflectances. The French Theia land data center, and the German Earth Observation Center (EOC) started delivering Sentinel-2 surface reflectance products to users in 2016 in France and 2019 in Germany respectively. Both centers produce and distribute these data sets in near real time, over large regions requested by French users such as Western Europe, Maghreb, Sahel, Madagascar ... Theia’s and EOC products include an instantaneous surface reflectance product (Level-2A), and a monthly cloud free synthesis of surface reflectance (Level-3A). This article shortly describes the methods used to generate the Level-2A products with the MAJA processor, and the Level-3A products with the WASP processor. The MAJA processor is based on multi-temporal methods, that use the slow variation of surface reflectance to detect clouds and estimate aerosol depth, while WASP, thanks to the quality of MAJA cloud mask, calculates a weighted average of all the cloud free observations over 45 days, every month. The article also provides validation results for Level-2A and Level-3A products, resulting from comparison with in-situ data and with other methods. A last section gives first insights from the monitoring of user uptake of the distributed products


INTRODUCTION
In the early decades of remote sensing data science, users would spend a large amount of their time devoted to remote sensing at ortho-rectifying, calibrating, correcting atmospheric effects and obtaining cloud free surface reflectance images. Nowadays, thanks to its open access policy, its systematic and frequent revisit, its global coverage and its data quality, the Copernicus Sentinel-2 mission (Drusch et al., 2012) revolutionizes the optical earth observation at a high resolution. More than 35000 users (Knowelden, 2019) have downloaded Sentinel-2 data in 2018. To spare users of lot of efforts, the role of producing and distributing analysis ready data (ARD), that allow a fast, robust and automatic exploitation of Sentinel-2 data, is now devoted to space agencies or data access centres (CEOS, 2020). The ARD products take care of the common burdens necessary for most applications, that include ortho-rectification, calibration, cloud detection and atmospheric correction steps.
For Sentinel-2, the Copernicus ground segment provides Level 1C products, which are ortho-rectified and expressed in top of atmosphere reflectance. The Level-2A product provides surface reflectance with a cloud mask (Drusch et al., 2012). The French Land Data center, Theia, at CNES, and the Earth Observation Center (EOC), at DLR, produce and distribute Level-2A products (L2A) thanks to the MACCS-ATCOR Joint Algorithm * Corresponding author (MAJA), which is the result of a common effort of the French and German space agencies, the Centre National d'Etudes Spatiales (CNES) and the Deutsches Zentrum für Luft-und Raumfahrt (DLR), and of the Centre d'Etudes Spatiales de la BIOsphere (CESBIO). MAJA is based on the Multi-temporal Atmospheric Correction and Cloud Screening software (MACCS) (Hagolle et al., 2015) and includes a few modules from the AT-COR software (Richter et al., 2006). Compared to classical processors based on multi-spectral relations to detect clouds and estimate aerosol content, MAJA also involves multi-temporal criteria, which assume that land surface reflectance tends to change slowly as compared to the atmospheric effects due to clouds and aerosols. To our knowledge, MAJA is the only Level-2A processor using multi-temporal criteria, it is therefore using more information to detect clouds and aerosols than the other methods.
But even with accurate surface reflectance and cloud masks, Sentinel-2 Level-2A data usage is still complicated and requires specific technical skills which are not possessed by all users. The presence of clouds causes spatial and temporal data gaps in the time series that have to be handled by automatic processing, and the swath of Sentinel-2 satellites is limited. It is therefore impossible to observe a large region in one overpass, and users have to rely on data acquired on different dates with different cloud covers. Merging different acquisitions may cause artifacts at the boundaries of swaths or around the data gaps due to clouds. Overcoming this difficulty is the role of the Level-3A product (L3A), which provides a monthly synthesis of cloud free surface reflectance. Since a few years, Theia and EOC have been distributing Level-3A products to users based on the Weighted Average Synthesis Processor (WASP), which calculates a weighted average of cloud free surface reflectances after a directional correction to normalize data acquired from different swaths.
MAJA and WASP processors are being used by CNES and DLR, within the Theia (Theia, 2020) and EOC data centers (EOC, 2020) respectively, to produce in near real time Sentinel-2 Level-2A and Level-3A products. These products have been delivered to more than 2500 users so far, and cover about a surface of 10 Million km 2 for THEIA and 0.4 Million km 2 for EOC (coverage of Germany). On average, each product generated has been downloaded 1.3 times, counting only real single users and not redistribution platforms. MAJA can also be applied to other satellites such as LANDSAT (Loveland, Dwyer, 2012), with a 16 day revisit, or VENµS with a two day revisit (Ferrier et al., 2010).
Validation is a key element to strengthen trust on an ARD product. CNES made significant efforts to obtain reference data and in-situ measurements, and participated to the atmospheric correction intercomparison international experiments (ACIX) (Doxani et al., 2018) that aim at cross comparing a large number of Level-2A processors. Some of these results are presented hereafter. Our presentation details the methods behind MAJA and WASP processors and provides validation results concerning the quality of cloud detection and atmospheric correction, as well as the quality of Level-3A syntheses. It finally gives some feedback on how Level-2A and Level-3A products are used.

Level-2A processing
The MAJA Level-2A processor is described in depth in MAJA's ATBD (Hagolle et al., 2017), and we only provide here a compact description of its core elements. Several operations are necessary to obtain surface reflectances and a good quality cloud mask for a Sentinel-2 scene: • estimation of water vapor content, using the water vapor band at 940 nm • correction for molecular absorption (including water vapour) • detection of clouds, cloud shadows, water, snow, using multi-temporal and multi-spectral criteria • estimation of aerosol content, using multi-temporal and multi-spectral criteria • correction of cirrus clouds (optional), thanks to cirrus band at 1370 nm • correction of scattering due to molecules and aerosols • correction of adjacency effects (the atmosphere tends to blur the images) • correction of terrain effects (due to the effect of slopes on the illumination) The detection of clouds, and the estimation of the water vapour and aerosol contents are performed at a coarse resolution (240 m), in order to speed-up the process. In order to account for differences in viewing direction between Sentinel-2 spectral bands, and also to include the cloud fuzzy edges, the cloud and shadow masks are dilated using a buffer of 240m. All the masks are then oversampled at full resolution before running the atmospheric effect corrections.
The main feature of MAJA is the combined use of multitemporal and multi-spectral methods to better detect clouds and cloud shadows and estimate atmospheric aerosol content (Hagolle et al., 2015). The use of multi-temporal information adds information to the retrieval, and provides an increased data quality and robustness. For instance, with a multispectral method only, it is generally difficult to tell a low altitude cloud from a bright cloud-free region, while the multitemporal method will detect the change in surface reflectance. Similarly, many multi-spectral aerosol estimation methods are limited to sites with dark targets and dense vegetation, while multi-temporal methods can be applied to arid regions. Besides, MAJA can optionnally benefit from auxilliary data on the relative proportion of seven aerosols species, interpolated to the time of acquisition of an L1C product, obtained from the Copernicus Atmosphere Monitoring Service (CAMS) (Inness et al., 2019). Using the CAMS data leads to refined atmospheric optical depth estimates as compared to the default use of continental aerosol properties.

Level-3A processing
For many of the optical earth observing satellites that provide systematic and frequent acquisitions, it has been a common practice to also deliver Level-3A products, which are cloud free images, valid for a longer period of time: one fortnight, one month, one year... Such processors were initially developed for moderate resolution sensors (Tarpley et al., 1984), as high resolution sensors lacked a high revisit frequency. It is not the case anymore with Sentinel-2 or VENµS satellites. Most of the Level-3A methods use, for each pixel, a so-called best available pixel method, involving several selection criteria which increase the likelyhood to select cloud free and shadow free pixels. The most classical one is selecting the date which has the maximal Normalzed Difference Vegetation Index (NDVI). Given that these methods work at pixel level, contiguous pixels are frequently acquired on different dates, with possibly different viewing angles and different atmospheric correction errors (Hagolle et al., 2005). Such artifacts can be seen on the left part of Figure 1. If the quality of the atmospheric correction and cloud detection in the Level-2A is high enough, it is not necessary anymore to select cloud free observations with a low likelihood of being affected by clouds and aerosols, as they are already simply discarded using the cloud mask. As a result, WASP can rely on a weighted average of the surface reflectances of all the cloud free pixels observed during the synthesis period. The weights are higher when the images cloudiness or AOT are low. Details on the method are provided in WASP algorithm specification . In order to merge data acquired from adjacent orbits, a preliminary correction of directional effects is needed: at the boundary of two adjacent swaths, the viewing angles change abruptly and may cause differences of a few percents in reflectance, which can be highly visible. For our correction, we used a correction method developped by Roy et al (Roy et al., 2017). This correction only depends on the sun and observation geometry, and not on Figure 1. comparison of a monthly synthesis obtained using a best available pixel method (left) and a using a weighted average method (right). The higher level of noise on the best available pixel image is due to the selection of different dates on adjacent pixels the type of target. Although it is a rough approximation, in most cases, it seems enough to avoid visible artifacts, as it may be observed for instance on Figure 7. However, the accuracy of this correction is reduced if the sun zenith angle is greater than 50°a nd we are actively working on a replacement of this method.

Cloud masks
As there is no current network of ground based observation of cloud presence at a high resolution, the validation of cloud masks is a complicated task. It first requires a good definition of what a cloud or a cloud shadow is, and there are divergences according to authors. Cloud mask validation usually relies on a lot of manpower to manually classify images, polygons or points on selected reference images. Baetens et al developed an active learning cloud detection method to interactively build reference cloud masks on whole images. A first selection of cloud or cloud free, shadow or shadow free reference pixels is made, that is used in a random forest surpervised classification to obtain a reference cloud and shadows mask, together with confidence estimates. Additional reference points are added to improve the confidence and remove errors before iterating the process again. After 5 iterations, the resulting reference cloud masks are obtained. Such a process can be done in one hour by a trained operator. The overall accuracy of reference cloud masks was estimated at 98%, which is enough to assess the accuracy of operational cloud masks.
Thirty cloud masks were generated on seven very different sites, for different dates. These masks were used to compare the performances obtained from three different operational processors: MAJA, FMask V4 (Zhu et al., 2015), which is used by the United States Geological Survey to produce Sentinel-2 data, and Sen2cor (Muller-Wilm, 2017), which is the official Level-2A chain used to produce ESA's Level-2A products. Both Fmask and Sen2cor do not use multi-temporal detection to detect clouds, but rely on multi-spectral criteria. However, FMask also uses a possibility offered by Sentinel-2 to use the small parallax due to the fact that the spectral bands observe the earth with slightly different viewing directions. MAJA uses this feature only for VENµS sensor. Due to this parallax, it is necessary to dilate the cloud masks, but Sen2cor does not implement a dilation because it would also dilate the frequent false detection of clouds over buildings. In order to compare similar cloud masks, the statistics provided below include a buffer of 240 m for the reference and evaluated cloud masks. The overall accuracy per image is provided on Figure 2. The average overall accuracy for the whole data set ranges from 91% for MAJA, 90% for FMask and 84% for Sen2cor. Even if these percentages are quite high, it has to be noted that an overall accuracy of 84% corresponds to 16% of errors, which is a large amount. For more detailed analysis, please refer to (Baetens et al., 2019).

Aerosols
The estimation of aerosol optical depth (AOD) is one of the main drivers of the quality of the atmospheric correction. It is therefore important to validate the estimation performances. Thankfully, the Aeronet network (Holben et al., 1998) provides daily measurements of aerosol optical properties on hundreds of sites. During the atmospheric correction inter-comparison experiment (ACIX-2), Aeronet measurements and Level-2A products were collected over a hundred of sites, and compared. About 10 Level-2A algorithms participated to the experiment, but for the sake of concision, we show here the results from our own method MAJA and Sen2cor which is the official Copernicus processor (Muller-Wilm, 2017). Figure 3 show the difficulty of estimating AOD, with a quite low signal to noise ratio and a high scatter of results, especially for large aerosol optical depths. The results of the other methods showed performances with the same order of magnitude of errors : a reduced mean square error (rmse) of 0.13 to 0.18 when all AOD contents are taken into account, and around 0.1 when AOD above 0.6 are discarded. As shown in next section, such an accuracy is sufficient to obtain a correct surface reflectance estimate as long as the AOD is below 0.6. Sen2cor (Muller-Wilm, 2017) can estimate AOD only if very dark and dense vegetation is present in the image. In the opposite case, weather analyses of aerosol content are used as default values when such pixels are not found within the image. Thanks to the multi-temporal criteria, MAJA retrieves AOD even above arid landscapes, although with a reduced accuracy compared to AOD retrieved over vegetation.

Surface reflectances
The actual output of Level-2A processors are surface reflectances, which is the quantity that should be validated. Nevertheless, the available sites with daily measurements of surface reflectances which can be expressed in the same observation geometry as Sentinel-2 satellites are very scarce. The ACIX-2 experiment only relied on two sites whose data are collected by CNES, La Crau, France, and Gobabeb, Mauritania. These two sites have been set-up to validate the absolute calibration of optical sensors, and they were chosen to minimize all error sources. As a result these sites are very uniform and are situated in regions with a generally low AOD. They also have quite high reflectances, especially Gobabeb, which makes them less sensi-tive to the atmosphere. In 2021, CESBIO will implement a new site in Lamasquere, in the South of France, to validate surface reflectances in more difficult conditions, with lower surface reflectances and less uniform landscapes.
Because of article length constraints, we only show on Figure 4 and 5 the results for La Crau, which is a site covered by some vegetation and not a desert site as Gobabeb. Results are shown for MAJA and Sen2cor, for the red (B4) and NIR (B8) bands. It has to be noted that all the biases observed are in the range of validity of the calibration biases of the satellite (3%) and station (4%). Both results are compatible with a good accuracy, and finally, it is the residual noise that presents best the ability to correct the day to day variations of the aerosol optical properties. This noise is slightly smaller with MAJA than with Sen2cor for band 8 ( Figure 5). As the final report of ACIX has not yet been published, the presented results are still provisional.

Validation of monthly syntheses
We have defined two criteria to estimate the accuracy of monthly syntheses produced by WASP: -Fidelity criterion: this criterion evaluates the similarity between i) the Level-3A monthly product centered on the fifteenth day of the month and ii) a Level-2A acquired within 3 days of that date, after having discarded the invalid pixels due to clouds, shadows, snow or water. The values presented below correspond to the maximum error for 70% of pixels.
-Artifact criterion: this criterion measures the spatial noise due to the use of different sets of dates in the composite. Because of the presence of clouds or cloud shadows, pixels are discarded, resulting in the use of different sets of dates on each side of the border of the discarded region.The artifact criterion computes the deviation of the synthesis surface reflectance along each side of the border. The higher the difference, the more visible the artifacts. The values presented below correspond to the root mean squared difference for the borders of all different sets of dates in one image.
The results shown on Figure 6 vary according to the sites. A synthesis product processed in a region with a low cloud cover will average much more data and show less artefacts and a better fidelity than a Level-3A product generated in a region which is often cloudy. We concluded from this study that the increase of fidelity errors as a function of synthesis duration was moderate, while the decrease of artifacts was high. WASP products are therefore produced every month, using a synthesis period of 46 days. Of course, this period could be tuned locally, and could be increased for winter or rain season periods. Applying the same method to VENµS data which has a two days revisit enables to deliver Level-3A products every fortnight, with a synthesis period of 22 days.
The images on Figure 7 show two examples of Level-3A products obtained with WASP for France in July 2018 and July 2019. The comparison of both images gives an idea of the consequence of the heat waves of the summer 2019, especially taking into account the fact that 2018 was also a year with a heavy drought. It can also be noticed that the images are almost seamless and have a very low level of artefacts which is a good indicator of the quality of Level-2A and Level-3A products.

USER UPTAKE OF LEVEL-2A AND LEVEL-3A PRODUCTS
This section aims at providing some feedback from the user uptake of our Level-2A and 3A products. The distribution of Level-3A products over France started in spring 2018, and the product needs to be sufficiently advertised so that users become aware of their existence and quality. As a result, the comparison of applications of Level-2A and Level-3A products is still preliminary. Level-2A being instantaneous products, they are favoured by users interested in fast changing surfaces, such as agricultural surfaces, snow, water. Of course, users interested in phenology also chose Level-2A. The first users of Level-3A products were interested in displaying recent fully cloud free images over large regions, and this product finds a lot of interested users to get a quick but detailed view of the current phenological situation, compared to that of previous years. It was for instance used by the main television channels in France to display the effects of the heat waves and drought of summer 2019 in France (Theia, 2019). Recent users have started exploring the use of syntheses to monitor the forest degradations due to insects or drought, or to follow annually the extents of urban footprints or to study vegetation in urban environments. Syntheses are also needed as input for land cover classifications. However, one of the main users of Theia products for land cover applications (Inglada et al., 2017), developed its own syntheses processor before Theia Level-3A prod- ucts were available. This simple processor used linear interpolation within closest cloud free dates to generate cloud free products every month, and we plan to study the impact on land cover classification performances of using WASP syntheses as a replacement to the simple interpolating processor.

CONCLUSIONS
To enable an easier uptake of remote sensing data by users, space agencies and data centers have developped Analysis Ready Data that spare users the burden of pre-processing the data. These tasks are one of the missions of the French and German land data centers, Theia and EOC. Both centers provide users with high quality ARD products generated with MAJA and WASP processors. A large scale validation effort is made to collect in situ data and assess the quality of these products. These validation results show that the detection of valid pixels (without clouds and cloud shadows) is accurate to 91%, AOD is measured with an uncertainty of 0.14 when taking into account heavy AOD cases. It is better than 0.1 if AOD cases above 0.6 are discarded. Surface reflectance errors are in the order of magnitude of 0.01 for Level-2A products and 0.02 for Level-3A products. However, continued research is necessary to further improve these results, especially for images acquired with high aerosol content.
As a complement to the assessment of product quality, monitoring the usage of these products and gathtering feedback from users will allow the data centers to better advise users on the potential and limitations of each product. As the Level-3A product is a relatively new product at decametric resolution, gathering this feedback is particularly important. We have drawn the following conclusions: first, Level-2A is fit for advanced users, while Level-3A is much easier to use. Second, Level-3A is adapted to surfaces which change slowly and applications that do not require a good accuracy on the date of acquisition. Level-3A should be recommended for applications related to forests, cities, general land cover, and natural vegetation cartography, and for multi-annual comparisons. Level-2A products should be used when the needed accuracy on observation date is high, for instance to use short differences in phenologies to separate tree species, or to monitor agriculture status and practices.