ESTIMATION OF OCEANIC PARTICULATE ORGANIC CARBON WITH MACHINE LEARNING

Understanding and quantifying ocean carbon sinks of the planet is of paramount relevance in the current scenario of global change. Particulate organic carbon (POC) is a key biogeochemical parameter that helps us characterize export processes of the ocean. Ocean color observations enable the estimation of bio-optical proxies of POC (i.e. particulate backscattering coefficient, bbp) in the surface layer of the ocean quasi-synoptically. In parallel, the Argo program distributes vertical profiles of the physical properties with a global coverage and a high spatio-temporal resolution. Merging satellite ocean color and Argo data using a neural networkbased method has already shown strong potential to infer the vertical distribution of bio-optical properties at global scale with high space-time resolution. This method is trained and validated using a database of concurrent vertical profiles of temperature, salinity, and bio-optical properties, i.e. bbp, collected by Biogeochemical-Argo (BGC-Argo) floats, matched up with satellite ocean color products. The present study aims at improving this method by 1) using a larger dataset from BGC-Argo network since 2016 for training, 2) using additional inputs such as altimetry data, which provide significant information on mesoscale processes impacting the vertical distribution of bbp, 3) improving the vertical resolution of estimation, and 4) examining the potential of alternative machine learning-based techniques. As a first attempt with the new data, we used some feature-specific preprocessing routines followed by a Multi-Output Random Forest algorithm on two regions with different ocean dynamics: North Atlantic and Subtropical Gyres. The statistics and the bbp profiles obtained from the validation floats show promising results and suggest this direction is worth investigating even further at global scale.


INTRODUCTION
The ocean plays a crucial role in the climate of our planet by regulating the amount of atmospheric carbon dioxide. The magnitude of carbon sequestration in the ocean is driven by two different mechanisms: the so-called physical and biological carbon pumps. The latter is governed by the global export of particulate organic carbon (POC) from surface waters to the deep ocean. However, despite their importance, the processes involved in the biological carbon pump are still poorly constrained. This essentially results from the paucity of global observations at the appropriate spatial and temporal resolution, and in particular in situ POC measurements. Therefore, and in order to start developing an in depth understanding and quantification of export processes at the context of global change, the first prerequisite is to acquire and/or develop data sets with improved spatio-temporal coverage.
The particulate backscattering coefficient (b bp ) is widely used as a bio-optical proxy for POC (e.g. Cetinic et al., 2012). b bp has the advantage that it can be continuously measured in situ from robotic platforms, like Biogeochemical-Argo (BGC-Argo) profiling floats (Claustre et al., 2020;Roemmich et al., 2019) or retrieved from satellite remote sensing. Thus, b bp is a key biooptical property for studying the space-time dynamics of the vertical distribution of POC, possibly opening a path for improving the characterization and quantitative assessment of the biological carbon pump in the global open ocean (Boyd et al., 2019;Briggs et al., In Press). Satellite-derived products of POC from b bp -based algorithms (Stramski et al., 2008) have also * Corresponding author shown their potential to study the spatio-temporal distribution of POC in the open ocean (Gardner et al., 2006;Loisel et al., 2002;Stramska, 2009). However, such satellite-based estimates, restricted to the ocean surface layer, are insufficient in the context of global carbon cycle studies including carbon production and export.
A recent study showed that a neural network-based method could efficiently extend surface bio-optical properties (i.e. b bp ) to depth by merging ocean color and hydrological data (SOCA method for Satellite Ocean-Color merged with Argo data to infer the vertical distribution of particulate backscattering coefficient; Sauzède et al., 2016). The interest in merging such type of data resides in the fact that b bp and hence POC reflects the stock of biological particles. This stock, derived from oceanic photosynthesis, is primarily driven by nutrient availability and light regime in the upper ocean which are both influenced by the physical forcing. Thanks to the Argo program operating and array of nearly 4000 robots measuring hydrological properties with much enhanced spatio-temporal resolution in the global ocean (Roemmich et al., 2009) the resuting acquired data can be combined with ocean color to retrieve the vertical distribution of b bp with high resolution.
Data-driven techniques have become more popular within the scientific community (Bergen et al., 2019) including the ocean sciences (Malde et al., 2019). We are dealing with an explosion of data from different sources of varying quality. Physical models are powerful but trial-and-error approaches to modifying these methods to accommodate new data streams is not possible. As an alternative, data-driven techniques within the machine learning (ML) community are numerous with many approaches that can handle the large quantity, quality and complexity Camps-Valls et al., 2019). In the context of the study mentioned above, a neural networkbased method was trained using the BGC-Argo floats database (∼4700 concurrent in situ temperature, salinity and b bp profiles). This method retrieves the b bp in the water column with an error of ∼20% at a global scale.
Merging data from different sources presents many challenges and so the original authors with the SOCA method (SOCA2016 hereafter, Sauzède et al., 2016) used artificial neural networks to find a function that predicts b bp vertically at a global scale. The original methods used were able to predict b bp for 10 different layers (from the surface to the depth where there is no more phytoplankton biomass). Although the database used in 2016 was representative of most open ocean oceanographic conditions in the global ocean, some areas were significantly undersampled (e.g. southern ocean). It is therefore expected that using the new BGC-Argo database available today (with ∼ 5 times more data and a much better spatial coverage), the method could be greatly improved. In addition, it is timely to consider and evaluate a more powerful method that would allow to estimate b bp at higher resolution along the vertical dimension which is of great interest for carbon export applications.
The success of SOCA2016 motivated the effort to create depthresolved global proxy of POC with higher space-time resolution, a prerequisite for improving the characterization and quantification of export carbon fluxes. In particular, investigators of biogeochemical models have shown great interest and their need for such products, essential for the initialisation and validation of biogeochemical models. Thus, this study takes place in the context of the European Copernicus Marine Environment Monitoring Service (CMEMS), one challenges of which is to improve SOCA2016 to have high level 3D gridded global products of POC (with associated estimation errors), to support biogeochemical model data requirements for their improvements.
The current study is aimed to improve upon the SOCA2016 method (upgraded method hereafter referred as SOCA2020) by 1) using the large amount of new acquired data from BGC-Argo floats network since 2016, 2) using additional inputs such as the sea level anomaly which could give significant information about sub-mesoscale processes the vertical distribution of phytoplankton biomass and hence of POC, 3) replacing some inputs such as the ocean color chlorophyll a concentration and b bp by satellite reflectances to avoid additional errors due to ocean color algorithms, 4) improving the vertical resolution of the outputs (b bp retrieval) and 5) investigating the potential of alternative machine learning-based techniques that could be more efficient and additionally could estimate the retrieval error associated to the outputs, an essential point in the context of modelling.

DATA AND METHODS
The BGC-Argo database used in this study is composed of concurrent vertical profiles of temperature, salinity and particulate backscaterring coefficient (b bp ) merged with satellite products. First, we present more in details BGC-Argo measurements. Then, the procedure for the matchup between BGC-Argo and satellite observations is given in detail. The third section presents the machine learning models envisaged to carry out this study.

BGC-Argo measurements
Profiling floats typically collect measurements from 1000 m to the surface with a 1 m vertical resolution every 10 days. When the float surfaces, data is transmitted in real-time using Iridium communication. Physical Argo profiling floats are equipped with the standard conductivity-temperature-depth sensors that allow one to continuously measure the temperature and salinity in the global open ocean since the early 2000s (Roemmich et al., 2009). The integration of new biogeochemical sensors on Argo floats has led to a new generation of floats, the BGC-Argo floats. These floats measure proxies of major biogeochemical variables such as b bp that is used to train and validate the SOCA methods.
The BGC-Argo profiling floats used in this study are equipped with backscattering sensors that measure the angular scattering coefficient at 124 • relative to the direction of light propagation at wavelength of 700 nm. This measurement is transformed into b bp (700) (hereafter b bp ) following Schmechtig et al. (2016). The same quality control procedure as in Sauzède et al. (2016) was applied to each profile. Because of their log-distribution, b bp values were log transformed.

BGC-Argo and satellite matchup database
For the development of the SOCA2020 method, the new inputs are: 1) ocean color data: the reflectances (ρ) at 5 wavelengths (412, 443, 490, 555 and 670 nm) and the Photosynthetically Available Radiation (PAR) and 2) altimetric data: the Sea Level Anomaly (SLA). The ρ are used in this study to replace Chl and b bp satellite estimations used in SOCA2016, in order to avoid additional input variability due to ocean color algorithms errors. For the long-term vision, PAR and ρ data come from GlobColour satellite multi-mission data (Garnesson et al., 2019) that were downloaded from the Copernicus Marine Environment Monitoring Service (CMEMS, http://marine.copernicus.eu/). The matchup was done using the value of the closest pixel available with a 5-day window (before and after the observation) and within a 5x5 pixel grid. This matchup procedure led to discarding ∼ 50% of the BGC-Argo profiles. The altimetric information (the SLA) is additionally used in SOCA2020 algorithm because it is highly linked to mesoscale structures that are known greatly influence the nitracline depth and so the vertical distribution of phytoplankton biomass and primary productivity (Lévy et al., 2018). The altimetric data are issued from the Global Ocean Multimission altimeter satellite gridded sea surface heights (available from CMEMS, daily data with a 0.25 • spatial resolution). The SLA is computed with respect to a 20-year mean of sea surface height.
The resulting BGC-Argo and satellite matchup database appears to be representative of a broad variety of hydrological and biogeochemical conditions prevailing in the global open ocean making the method applicable everywhere. Here, we focus our study on the North Atlantic Ocean (NA) and the oligotrophic Subtropical Gyres (STG) (blue and red points in Figure 1, respectively). These two areas show quite different physical characteristics and dynamics: the NA ocean presents less salinity and lower temperatures than the STG throughout the year, and it presents strong mixing of water during winter (mixed layer depth, MLD, acquired by the BGC-Argo floats vary between 15 and 900 m). STG areas have a marked water stratification specially during summer, with high sea surface temperature and deeper nitracline depth. These datasets are also representative of most trophic conditions observed in the open ocean (i.e., from oligotrophic to eutrophic waters) and variations in phytoplankton species composition and sizes.

Preprocessing
We implemented some basic preprocessing techniques to make the algorithms easier to train. Ultimately, there were over 1,100 features ( Table 1) with majority of the features coming from the variables temperature, salinity, density and spiciness (that reflects isopycnal water-mass contrasts). In addition, there were only 2,860 samples for the North Atlantic dataset and 1,353 samples for the Subtropical Gyres dataset. This is a bad samplesto-features ratio so we reduced the amount of correlation between the large number of features and alleviate this burden from the machine learning algorithms by implementing a series of simple transformations to better capture the most important aspects of our data.
The distributions for the core variables (SLA, PAR, MLD and the ρ at 5 wavelengths) were skewed and heavy tailed so we did a simple standardization by removing the mean µx from each feature and dividing by the standard deviation σx. There was still very little correlation across variables except between some of ρ variables like ρ412 and ρ443. Some variables are cyclic in nature so we converted the day of the year (DOY) variable to the corresponding sin and cos representation to better capture the time component. The geographic coordinate system (lat, lon), while relevant in Earth sciences, can be very difficult for machine learning algorithms due to the Earth curvature. For example, utilizing distance calculations like the euclidean distance between samples is non-trivial task in a geographical coordinate system compared to a Cartesian coordinate system. The trade-off is that the eventual predictions might produce errors due to the between-coordinate transformation errors. So we converted the latitude and longitude (lat,lon) features into Cartesian (x,y,z) features to better accommodate euclidean-centric distance calculations.
The high dimensional variables (temperature, salinity, density and spiciness) were the ones with the largest amount of features. Each variable had one measurement per layer (all 276). So one option would have been to use the 1-to-1 layer correspondence between the input and the output. This would mean that each output for the y would have training data specifically from its corresponding layer which might proven to be effective. However, we wanted to see if we could learn the relationships between the variables and not just the individual levels.
Hence, we decided to use a Principal Components Analysis (PCA) decomposition on these variables. 5 PCA components  were extracted from each variable thus reducing the number of features for the high dimensional variables from 1,104 (4×276) features to just 20 (4 × 5) features; significantly less. As shown in Figure 2, the cumulative explained variance for 5 PCA components was ≥ 98% and ≥ 97% for the North Atlantic and for the Subtropical Gyres dataset, respectively. An argument can be made that extra 2-3% explained variance hidden in some of the remaining PCA components could have a big impact for detecting rarer events and/or extreme b bp profiles (i.e. the tails of the output distribution). However, that would increase the number of redundant features of our input dataset which would make the machine learning algorithm harder to train.
The number of outputs for the b bp is 276 which is very high; very unusual for a machine learning setting. Each output corresponds to a depth so most of the variability was near the shallower regions for both the North Atlantic and Subtropical Gyres datasets as seen in Figure 3. This was verified through the mean and standard deviation of the outputs as it was heavily skewed towards first 100 depths. So a simple log transformation was used to increase the spread of the distribution to be more Gaussian-like. Regardless, we still have the problem of having a large number of outputs which is very difficult for a machine learning model to train with a modest number of data points. We considered doing a PCA transformation on the output depths to reduce the number of outputs, but instead we decided against it for the first pass as it adds a level of complexity.

Machine learning models
Multi-output regression methods are a challenge even in the machine learning community and there is no clear consensus about the best way to handle this problem in regression settings. Generally, there are two main approaches to this problem from a machine learning perspective: a single model for each output or a single model for all of the outputs (Xu et al., 2019). The ideal case is to have a single model to account for the correlated outputs and this makes intuitive sense because we know that the outputs are well correlated; for example the overall shape of the output (depths in our case) would be captured instead of looking at individual parts. In addition, this approach is especially powerful when you have missing data and would like to use semi-supervised learning (Álvarez et al., 2012). However, this approach can be more expensive, more difficult to train, and ultimately there are not very many machine learning models that are explicitly designed to handle multiple outputs. Some examples of ML models that can handle multi-output data include composition of functions like Neural Networks, Bayesian methods like Gaussian processes, and ensemble methods like Random Forests ( We also chose some of the simplest class of models like linear regression and random forests. Although we have plans to do a more extensive comparison between the approaches, this is outside the scope of this paper and henceforth when we refer to multi-output methods, we are assuming a single model that can handle multiple outputs. We considered some baseline and robust methods for this experiment. We looked at 2 classes of models: linear models and ensemble models. We chose these models because they are an excellent choice for a first pass on new datasets and they are more easily intepretable. These "weakly" parametric models are robust, generalizable and can fit a large number of different datasets. This allows us to avoid making too many non-testable assumptions on the pre-processing step. The next step is to start adding more physical and intuitive constraints and expectations, e.g. priors and uncertainty estimates. This is future work but will require a Bayesian perspective of things. The baseline linear model we used is a simple regularized linear regression (RLR) model; the ridge regressor. We hypothesized that this model would perform well on the STG dataset but not as well on the NA dataset due to the non-linearities present in the shallower regions. The ensemble method used is a Random Forest (RF) regressor. This is an ensemble algorithm which averages several independent decision tree estimators. The variance is reduced because of the averages and in general, it is robust to overfitting. Furthermore, we can extract the feature importance from the model to see which features had the greatest impact on the predictions within our model. The models were trained using 1,000 estimators and a mean squared error (MSE) criterion. Please see the github repository ML4OCEAN for some example notebooks highlighting the preprocessing routines and the machine learning models. For the experiment section, we will showcase the ridge linear regressor (RLR) and RF regressor results for the valdiation data. But we only show the RF model for the profiles as it was the better of the two models.

EXPERIMENTS
Two full machine learning pipelines were processed: one for the North Atlantic (NA) dataset and the other for the Subtropical Gyres (STG) dataset. All preprocessing steps that did not involve normalization were done prior to splitting the data into training, testing and validation. Note that before this splitting, the time series from two BGC-Argo floats (one in NA region and the other in the STG regions, more precisely in the South Atlantic Subtropical Gyre, identified by their official World Meteorological Organization number 6901486 and 3902121, respectively) were removed from the database to create an "independent data set" used for additional validation. After the split into training, testing and validation datasets, the normalization procedure was done on the training set only and then the transformation was done on the other two sets.  Figure 4 shows the R 2 and the RMSE for the North Atlantic ocean (NA) and oligotrophic Subtropical Gyres (STG). In the NA, the RLR and the RF algorithms performs well in the layer very near to the surface and beyond 200 m depth. The "middle" and upper layers (from ∼ 10 to 200 m) are where the weakest predictions are found. In contrast, RMSE values decrease with depth, with a maximum peak at ∼ 100 m. These weaker predictions are related to the depths where b bp values show the more variability ( Figure 3). Moreover, the minimum peak of accuracy at 400 m (low R 2 and high RMSE in Figure 4) is related to a higher variance in the b bp values contained in the training dataset at this specific depth. This variance is probably due to profile(s) with "spikes" (POC intense increase at depth) that can be directly linked to POC export due to sinking particles. These "extreme" profiles may have led to a decrease in the method performance for this specific layer. In the same way, the STG dataset shows an increase of the R 2 value first and then a relatively unstable decrease with depth, until reaching a peak of lower accuracy and higher RMSE at around 200 m depth. This peak is more accentuated with the RLR model. This depth is associated with the layer where b bp is more variable in STG regions because of the so-called deep biomass maxima that can be found in these oligotrophic areas (between ∼ 150 -200 m depth, Mignot et al., 2014). RMSE values in the STG areas show much more variability, related to the higher variance in the input data shown in Figure 3. However, the RMSE are lower for STG in a quantitative way (range from 0.1 to 0.15 for STG area compared to a range from 0.15 to 0.4 for NA area) because of the lower range of b bp values found in these oligotrophic gyres. The highest R 2 and lowest RMSE values very near the surface for both areas with both models can be explained by the fact that the methods should easily link surface b bp values to surface satellite reflectance (ρ) inputs. Besides, some ocean color algorithms retrieving biogeochemical parameters from ρ at several wavelength are based on machine learning algorithms.

Test Data
To better understand the validation results and relate them to the input features and response of the model, the feature ranking for the RF algorithm is shown in Figure 5 (for the entire output domain, not just a single layer). For the NA dataset, the most dominant features are the principal components of the density, temperature and salinity. Indeed, there is a lot of physical variability in this area, that can explain the POC vertical distribution. For example, a strong winter mixing can bring phytoplankton biomass and POC up to 1000 m depth. In addition, the location seems to be also very important which was expected due to the variability in this area. We used NA data from high latitudes to 0 • latitude (Figure 1) so this area is representative of very different trophic regimes (from high latitudes productive regions to the North Atlantic oligotrophic gyre). Surprisingly, less weight have the remote sensing reflectance, PAR and SLA variables, that are, those that affect more the upper layers. For the STG dataset, the location was the most important followed by some of the principal components for the temperature, salinity and density. The STG dataset is composed of 5 different gyres distributed on the planet whereas the NA dataset is much more spatially localized. Like the NA dataset, the reflectance and other variables like the MLD and PAR were not as important. The MLD has less seasonal variability in this warmer a) North Atlantic b) Subtropical Gyres area where waters are almost always stratified. In these areas, the information from PAR input may be supported by the DOY (day of the year, sinus and cosinus transformed) inputs. However, the SLA has a greater impact for the STG than the NA. This is due to the high impact of mesoscale and sub-mesoscale processes on the vertical distribution of phytoplankton biomass and POC in the oligotrophic areas where the surface waters are nutrient-depleted (e.g. Dufois et al., 2016).
One observation that can be made is that the "surface" variables (remote sensing reflectances, SLA, PAR) seem to have less weight in comparison to all of the variables. One explanation is that the features are shared for all of the outputs and therefore it would make sense that the upper layers are less accurate and/or the surface variables are less important as they only account for a fraction of the entire water column. Further steps can be taken to try to model each layer independently with independent features to verify that these variables are more prominent for the surface layers but you would lose the correlated outputs.

Validation Floats
The validation of the results has been made using two independent floats from the two separate areas ( Organization number 6901486 and 3902121 for NA and STG regions, respectively) by comparing the RF-retrieved b bp and the BGC-Argo measured b bp at each location in the 276 layers (i.e. 276 depths from the surface to 1,000 m depth). Figure 6 shows the typical measured b bp profiles that were used for the validation, very similar in shape and magnitude to the typical vertical profiles comprised in the training set. The scatter plot of measured vs. predictions for the NA dataset in Figure  7a shows a high R 2 of 0.86 for the RF algorithm and Figure 7b shows 0.73 for the RLR algorithm. The spread of the points appear to be for the lower depths (as for the test dataset in Figure 4), and an overestimation on the surface. In Figure8a, the STG displays a lower R 2 of 0.83 for the RF algorithm and Fig-ure8b shows a higher R 2 value of 0.86 for the RLR method.
The spread of the points on the 1:1 line is less compared with the NA; however, most of the points are situated above the line (constant overestimation) for the RF algorithm and below the line (constant underestimation) for the RLR algorithm. It is important to note that these validations show slight better or comparable statistics than the SOCA2016 independent validations for the NA and STG regions (R 2 = 0.81 and 0.85, slope = 0.81 and 0.85 and Mean Absolute Percent Difference, MAPD = 12% and 21% for SOCA2016 in NA and STG, respectively; see statistics for comparison with SOCA2020 in Figure 8). As SOCA2020 retrieves b bp with a greatly improved depth resolution, this present work shows very promising results. Figure 9 and Figure 10 show the comparison between in situ measured and RF-estimated b bp time series for the two validation floats over the water column (from the surface to 1,000 m depth). Results from the predictions are fairly smooth compared to the measured for the NA dataset ( Figure 9). Some of the details near the surface cannot be reproduced with high details in the predicted profile. Results from the predictions are also smooth compared to the in situ measured b bp for the STG dataset ( Figure 10)

CONCLUSIONS
Preprocessing techniques and machine learning model presented in this preliminary study give promising results, when using large datasets and attempting to predict a high number of output layers. The overall performance statistics are quite good and b bp vertical profiles present high similarities when compared with in situ observed data. This is still ongoing work, so we expect a) Random Forest Observations ( to see even better results (or at least more physically consistent) with algorithms and models adjusted to the several characteristics of the water column. Ultimately, using ML models to increase predictions of b bp profiles is a good endeavour and could be a viable option when coupled with more physical constraints and validation. Eventually, derived uncertainties will also be tackled, which will require another family of methods not yet tested.
The results from the new method (SOCA2020) will be available to users as part of the European Copernicus Marine Environment Monitoring Service (CMEMS). More specifically, the 4-dimensional products of particulate organic carbon (estimated from b bp using the method of Cetinic et al. (2012)) will be produced using merged hydrological and satellite (ocean color and altimetric) gridded-data available from CMEMS. The resolution of these products will be 0.25 • x0.25 • spatially, weekly temporally (from January 1998 to December 2018) and at 19 depth levels vertically from the surface to 1,000 m depth. These resolutions are defined from the lower input products resolutions (i.e. physical data). In addition, a multi-year monthly climatology will be provided. These CMEMS products will be first released within the year 2020 and then will be updated yearly. As all CMEMS products, these products will be qualified against totally independent in situ observations.
One of the future perspective of the present study is to develop the same method as SOCA2020 to retrieve chlorophyll a concentration, that is also a key biogeochemical product measured from profiling floats. The conjoint use of these two SOCA methods (that will retrieve b bp and chlorophyll a concentration) would offer a new path to examine the variability in the phytoplankton carbon to chlorophyll relationship over the vertical dimension, which would represent a great opportunity for a better understanding of light and nutrient control of phytoplankton biomass and physiological status at a global scale. This is a crucial step for improving the characterization of the distribution and variability in ocean primary production and carbon export.