GNSS-DERIVED PRECIPITABLE WATER VAPOR MODELING USING MACHINE LEARNING METHODS

: Atmospheric water vapor plays a vital role in phenomena related to the global hydrological cycle and climate changes, and its Spatio-temporal modeling and prediction help to identify and predict climatic phenomena. Accordingly, in this study, hourly precipitable water vapor (PWV) data sets for 27 stations receiving Global Navigation Satellite Systems (GNSS) observations in one month and machine learning methods were used to estimate PWV. Machine learning methods used in this study 1. Random Forest Regression (RFR) method 2. Extreme Gradient Boosting Regression (XGBR). The root mean square error (RMSE) in PWV estimation with the RFR method (RFR PWV) is 2.42 mm, and in PWV estimation with the XGBR method (XGBR PWV) is 2.75 mm, and the R-squared (R 2 ) of the RFR method is 0.74, and for the XGBR method, these values are equal to 0.71. The obtained results show the efficiency and accuracy of both models in estimating PWV, which shows that machine learning methods have been able to recognize the behavior and changes of precipitable vapor in a small spatial and temporal interval. Although both ways had high accuracies, the RFR model performed slightly better and had better accuracy than the XGBR model.


INTRODUCTION
Water vapor is one of the most essential and abundant greenhouse gases in the earth's atmosphere and keeps the temperature of the earth's surface above the freezing level.Atmospheric water vapor plays an influential role in global weather, climate change and hydrological cycles.Also, this parameter is essential in many atmospheric phenomena such as flood, precipitation, etc. (Bevis et al., 1992;Philipona et al., 2005).This parameter varies significantly in different spatial and temporal scales.Accurate measurement of water vapor and changes in its distribution has become one of the fundamental problems in synoptic, weather forecasting, and climate research.Therefore, the knowledge of the rapid changes in water vapor is essential for analyzing global and regional water vapor distribution (Gendt et al., 2004;Ning et al., 2016;Wong et al., 2015).Meteorologists have provided many parameters to express the water vapor in the atmosphere.Precipitable water vapor (PWV) is one of the most common.If all water vapor in a vertical column of the atmosphere condenses to a cross-section of one cubic meter, the depth of liquid water in this column is called precipitable water vapor.Since the demand for accurate and real-time weather services has increased, traditional methods such as radiosondes, water vapor radiometers, and solar photometers cannot continuously estimate water vapor with high accuracy and time resolution.Therefore, the demand for having meteorological values with high spatial and temporal accuracy and resolution using the Global Positioning System increased.(Jin and Su, 2020;Kourtidis et al., 2015).Bevis et al. in 1992, first introduced the theory of meteorology with GPS to estimate atmospheric water vapor with the help of GPS-based ground receiver observations (Bevis et al., 1992).This method has been noticed by researchers as a powerful tool in PWV estimation due to its usability in different weather conditions, continuous observations with very high time resolution, low cost, and PWV estimation with an accuracy of about 1-3 mm compared to radiosonde (Bevis, 1994;Foster et al., 2000;Niell et al., 2001;Ning et al., 2016;Van Baelen et al., 2005;Vey et al., 2009;Zhao et al., 2020).After that, Rocken et al. 1993; implemented the Bevis theory method using two GPS receivers located 50 km apart and compared the obtained results with the water vapor radiometer station (WVR); the difference in water vapor obtained from these two methods was about 1 mm.In 1997, Elgered et al. investigated and modeled air mass movement using four years of GPS network observations.The results show a perfect agreement of PWV obtained from GPS with radiosonde andWVR values. From 1997 to 2001, Rocken et al., Emardson et al., andNiell et al. conducted many studies in the field of PWV estimation using GPS, and the satisfactory results of these studies proved the effectiveness of GPS networks in meteorological studies (Niell et al., 2001;Rocken et al., 1997).Gradinarsky et al. in 2002 compared meteorological satellite, radiosonde, and GPS data to observe the seasonal behavior of PWV and found significant trends using seven years of data (Gradinarsky et al., 2002).Grubbs and Jain, in 2017, used nine years of data in Sweden to examine trends in PWV data obtained using radiometers, radiosondes, and GPS.Other researchers also conducted similar studies (Barman et al., 2017;Duan et al., 1996;Gradinarsky et al., 2002;Jin et al., 2009;Vey et al., 2009;Wagner et al., 2006).The use of machine learning (ML) methods in recent years to estimate an environmental or physical parameter based on its relationship with other factors has made significant progress.ML techniques are good alternatives for analyzing complex biological systems (Kasampalis et al., 2018;Seyed Mousavi and Akhoondzadeh Hanzaei, 2022).The use of the machine learning method is expected to perform well in PWV estimation, but different techniques and different modeling scales can show other performances.In this article, two machine learning methods, Random Forest Regression (RFR) and Extreme Gradient Boosting Regression (XGBR), are used to estimate PWV in the American region.Random forest is an ensemble learning method can be used for regression or classification.The XGBR model was developed by Chen and Guestrin and is an advanced and popular algorithm used in ML.This study is organized as follows: In Section 2, the study area and the data used are presented.How to estimate PWV from GNSS data is also studied in this section.In section 3, RFR and XGBR methods are explained, and also, these methods are applied to GNSS data.Statistical analyzes and comparisons of models are presented in Section 4. Finally, the conclusion is placed in section 5.

Study Area
The Plate Boundary Observatory (PBO) network stations were launched in 2008 for 3D strain monitoring in North America and Alaska and have since been developed.Some stations in this network have a high rate of observations (1 and 2 seconds).The stations of this network have been used in this study.The studied area is located between the longitudes of -118.6 to -117.6 degrees and the latitudes from 34.4 to 35.4 degrees.Figure 1 shows the distribution of stations used in this article.

ECMWF ERA5 data
The fifth generation of Atmospheric Reanalysis (ERA 5) data from the European Center for Medium-Range Weather Forecasts (ECMWF) provides data from 1979 to the present.This database provides parameters such as temperature (T), pressure (P), and other meteorological variables at grid points with horizontal resolution ( 0.25 0.25  for ERA 5) globally.Due to high spatial and temporal resolution and global coverage, reanalysis products produced by ECMWF have been used in various fields, such as GNSS meteorology.However, the time resolution of the analysis data is different.ERA 5 can provide meteorological data and parameters with a time resolution of one hour.Therefore, ERA 5 has excellent potential in retrieving PWV with high temporal resolution.In this study, surface pressure and temperature data from ERA 5 data series are used.

GNSS PWV data
When GNSS signals pass through the troposphere to reach ground receivers, the signals are delayed.This delay can be converted from oblique mode to zenithal mode using mapping functions; The delay in the zenith direction is called the zenith tropospheric delay (ZTD).ZTD can be divided into two components, tropospheric dry delay (ZHD) and tropospheric wet delay (ZWD).Tropospheric dry delay is a function of pressure and temperature on the earth's surface, which can be calculated using meteorological parameters measured on the earth's surface with an accuracy of a few millimeter (Bevis et al., 1992).In this study, ZHD is calculated using Eq. ( 1) (Saastamoinen 1973).0.002277 (1 0.00266 cos(2 ) 0.00000028 ) ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume X-4/W1-2022 GeoSpatial Conference 2022 -Joint 6th SMPR and 4th GIResearch Conferences, 19-22 February 2023, Tehran, Iran (virtual) This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.
Where, s P the surface pressure is in millimeters,  and H the latitude and orthometric height are in meters, respectively.Due to the extensive time changes of water vapor, it is impossible to model the total delay component with high accuracy.
The delay of the entire troposphere in the zenith direction can be used using accurate GNSS data processing software such as Bernese (Dach et al., 2007).By deducting the tropospheric dry delay from the total tropospheric delay according to Eq (2), we get the tropospheric wet delay in the zenith direction.
In Eq. ( 3),  it is the conversion factor, and it is a unitless quantity that is calculated using Eq. ( 4).
Where, v R the gas-specific constant for water vapor is equal to Also, m T it is the weighted average of the atmospheric temperature, and it is estimated using the temperature and water vapor pressure of the region (Davis and Herrinch, 1985).Where, e the water vapor pressure is in millibars, and T the temperature is in degrees Kelvin.Experimentally, the value of the conversion factor  is equal to 0.15 The actual value of this quantity varies between 0.12 and 0.18 depending on the latitude, season and climate of the studied area.In this article, the m T experimental model presented for the United States, defined as follows, is used.In Eq. ( 6), the surface temperature ( 0T ) is Kelvin.

METHODS
As you can see in Figure 2, we first estimated the PWV for the studied period using the GNSS observations and then used the PWV estimated from the GNSS observations and the meteorological parameters that we extracted from the ERA 5 data.We trained machine learning models.In this study, we used 80% of data to train RFR and XGBR methods and 20% of randomly selected data to test and evaluate the obtained model for PWV estimation.By randomly selecting 20% of the data to test the model, we created gaps in the time series, and using the used machine learning models; we estimated the value of PWV in the times when the hole was created.Machine learning methods are explained below.

Random Forest Regression
RFR is a non-parametric supervised machine learning approach tree-based algorithm where many decision trees are trained with random samples from the training (Shah, Angel et al. 2019).For regression problems, each tree can consider a large set of regression trees for decision, and Each tree is considered as a vote (Zarei et al., 2021, (Wang, Zhou et al. 2016).

XGBoost Regression
Extreme Gradient Boosting (XGBoost) is a machine learning regression developed by Tianqi Chen based on a gradient boosting algorithm.it uses residuals to improve the model, mean XGBoost integrates weak regression into strong regression, and iteratively produces new trees to fit the residuals of the previous tree (Jing, Zou et al. 2022). in addition, in comparison to the prior algorithm first, it can do parallel computing, Second, by using a regularized model it has better management against overfitting (Zamani Joharestani, Cao et al. 2019).The XGB algorithm can do regression and classification duties in many applications, including remote sensing.The boosting popularity of this algorithm is due to high accuracy and stability relative to other algorithms (Arjasakusuma, Swahyu Kusuma et al. 2020).

RESULTS
The evaluation of RFR and XGBR models has been done using the observations of 27 GNSS stations in southwest America.These observations are for days 86 to 117 in 2021.The results of three stations have been randomly selected to analyze the estimation of precipitable water vapor by RFR and XGBR models.Table 1 provides information such as R 2 and RMSE for these stations.

Comparison of PWV obtained from GNSS and RFR model
After the training stage, it is possible to estimate the amounts of water vapor that can be rained for 86 to 117 days in each station.
On the other hand, GNSS PWV values have been estimated for one hour in the studied period in all stations, in the following, the time series of GNSS PWV values of different stations have been compared with the corresponding values obtained from the RFR model.Figure 3 shows the time series of PWV results obtained from GNSS and from the RFR model for selected stations.The R2 values obtained for the selected stations range from 0.71 to 0.75, and the RMSE ranges from 1.98 to 2.95 mm.As seen in Figure 4, in the time gaps created, the PWV estimate by the RFR model was close to the GNSS PWV values and had a high R2 .

Comparison of PWV obtained from GNSS and XGBR model
Figure 5 shows the time series of precipitable water vapor estimated using the XGBR model in the desired period, as well as the time series of PWV values obtained from GNSS.As shown in Figure 5, the estimated values for PWV by the XGBR model are very close to the actual values.According to Table 1, the R 2 for the XGBR model includes values ranging from 0.70 to 0.73, and its RMSE has values in the range of 2.23 to 3.49.In Figure 6, the R 2 between the values estimated by the XGBR model and the actual values obtained from GNSS can be seen.

Figure 1 .
Figure 1.Study area and distribution of GNSS stations

Figure 2 .
Figure 2. Diagram of different steps of training models.

Figure 6 .
Figure 6.RMSE and R 2 of the PWV differences for three GNSS stations in the XGBR model.

Table 1 .
Statistics of PWV estimates for different models and stations.