ESTIMATING AIRBORNE PARTICULATE MATTER IN THE NATIONAL CAPITAL REGION, PHILIPPINES USING MULTIPLE LINEAR REGRESSION AND GRADIENT BOOSTING ALGORITHM ON MODIS MAIAC AEROSOL OPTICAL DEPTH

: The generation of air quality concentration data is imperative for the health and environment of highly urbanized regions. Through remote sensing, air pollutant concentrations can be obtained over large areas for a long time. In this study, particulate matter (PM 2.5 and PM 10 ) concentrations were estimated using satellite-derived Moderate Resolution Imaging Spectroradiometer (MODIS) Multi-Angle Implementation of Atmospheric Correction (MAIAC) Aerosol Optical Depth (AOD) values observed in the National Capital Region (NCR), Philippines. Models were generated using multiple linear regression (MLR) and gradient boosting regression to determine the best models for the whole data from 2017 to 2020, dry season, and wet season with a 70-30 split for the train-test sets. Initial models resulted with the best coefficient of determination R 2 values of 2.6% and 1.2% using MLR and 2.0% using gradient boosting regression. The results for PM 2.5 and PM 10 showed the lowest Root Mean Square Error (RMSE) values of 8.79 μg/m 3 and 18.99 μg/m 3 using MLR and 8.08 μg/m 3 and 16.85 μg/m 3 using gradient boosting, respectively. The preliminary results indicate the relatively poor performance of models in estimating particulate matter using satellite-derived AOD images.


INTRODUCTION
Air quality monitoring in the Philippines is conducted through ground monitoring station data regulated by the Department of Environment and Natural Resources (DENR).Through air quality monitoring, air pollutant concentration data are gathered to assess if concentration levels are good, unhealthy for sensitive groups, or at emergency levels.However, this assessment solely depends on the position of the monitoring stations placed by the regulatory department; therefore, limiting the scale of assessment based on the location of the air quality monitoring stations (Krunpick et al., 2003;Aniceto et al., 2021).Moreover, the generation of regulatory-grade air quality monitoring stations is costly and would take a huge amount of time for planning and budgeting before being operational.Additionally, there are times when insufficient data were recorded for specific stations.In this sense, air quality sensors were developed by researchers and groups to generate more data, however, the coverage is still limited depending on the target area of the respective owners (Galarpe, 2017;Cruz et al., 2019).On the other hand, air quality monitoring is also conducted nowadays by utilizing geospatial and remote sensing techniques to gather air quality data over large areas.Air quality data are either directly obtained from sensors in various satellite platforms or derived from satellite images using regression analysis or dispersion models.Engel-Cox et al. (2012) provided recommendations on the use of satellite remote sensing data for urban air quality assessment.The researchers showed the integration of ground-based and satellite data for air quality monitoring and the difficulties encountered in terms of collaboration, access, resources, and scope of analysis.Landsat, Multi-angle Imaging Spectroradiometer (MISR), Moderate Resolution Imaging Spectroradiometer (MODIS), TOMS, SeaWiFS, and SPOT were a few of the satellite systems discussed that are ready to be used for air quality applications.Duncan et al. (2014) of NASA Goddard Space Flight Center reviewed and discussed the use of satellite data for air quality applications.Satellite data was determined to be useful in tracking pollutant plumes, air quality forecasting, exceptional events demonstration, and data for air quality models.Li et al. (2020) proposed the integration of estimated particulate matter (PM) concentrations from satellite Aerosol Optical Depth (AOD) products and ground-based data from a network of lowcost PM sensors.Specifically, 75 monitoring stations and 2,363 AirBox low-cost sensor data together with MODIS Terra remote sensing data were used to generate surface PM estimates.Moreover, Allen et al. (1997), Chung et al. (2001), Hauck et al. (2004), andTakahashi et al. (2008) determined that traditional methods for PM measuring such as the gravimetric method, β-ray attenuation monitoring, or the use of tapered element oscillating microbalance (TEOM) were costly and require regular maintenance.Therefore, other methods of measuring PM measurements were introduced, especially low- cost PM sensors and satellite-derived PM concentrations.Satellite-derived AOD is used in several studies for the estimation of ground-based particulate matter with particles of less than 2.5 µm in diameter (PM2.5).Various models such as linear regression, multiple linear regression (MLR), artificial neural networks, and geographically weighted regression were tested out by studies as a means of improving the relationship between AOD and PM2.5.One of the fundamental requirements in conducting the said models is the availability of satellite data.Gogikar, et al. (2020) conducted a study for the estimation of PM2.5 values from satellite-derived AOD using different kinds of regression models, namely, simple linear regression, MLR, log-linear regression, and conditional-based MLR.The study area is in Agra and Rourkela region, India with inclusive years ranging from 2009 to 2015.It was found that the coefficient of determination (R) was statistically significant from the use of Model II or MLR.Moreover, Othman et al. (2010) generated PM10 estimates using Landsat 7 ETM+ satellite images and insitu measurements from the DustTrak aerosol monitor 8520.Multiple linear regression analysis was used to generate the PM10 estimation model from the RGB bands of Landsat 7 ETM+.The coefficient of determination resulted in 0.888 and validation with in-situ data gave an accuracy greater than 0.8 for the R coefficient.Machine learning algorithms, specifically gradient boosting, were also used by several researchers in estimating air quality concentration from satellite-derived images.Gradient boosting is a machine learning algorithm used to produce prediction models from an ensemble of weak predictive models for regression and classification.Gradient boosting optimizes a loss function depending on the type of problem needed to be solved, with squared error for regression as an example.The algorithm uses weak learners in the form of decision trees to make predictions.The output of the final model is improved and corrected by adding the output of the new tree to the existing sequence of trees until the loss reaches an acceptable level where no improvement can be observed (Natekin and Knoll, 2013).Gradient boosting is similar to random forests in terms of combining decision trees in the algorithm, however, random forests combine them at the end while gradient boosting combines the trees along the way In general, gradient boosting results in better performance than random forests if parameters were tuned.It would even work with unbalanced data, and reduce the chances of overfitting (Cai et al., 2020).Extreme Gradient Boosting (XGBoost) is an optimized gradient boosting algorithm that applies level-wise tree growth designed to be highly efficient, flexible, and portable.Chen et al. (2019) improved the estimation of ground PM2.5 estimation using satellite-derived AOD and extreme gradient boosting (XGBoost) to reduce limits and biases in estimating PM2.5.Additionally, a two-step method was used to interpolate missing values in AOD, reducing the missing value rate of daily AOD data to 13.83% from 87.91%.Using XGBoost regression with a non-linear exposure-lag-response model (NELRM) resulted in a cross-validation coefficient of determination of 0.86, a Root Mean Square Error (RMSE) of 14.98, and a Mean Absolute Percentage Error (MAPE) of 23.72%.Another study by Fan et al. (2020) introduced a development in estimating PM2.5 using satellite-derived AOD using spatially local extreme gradient boosting (SL-XGB) to obtain accurate results in unsampled spatial areas while also filling gaps in satellite-derived AOD.This study shows the initial models generated for PM2.5 and PM10 estimation for seasonal and yearly models using MODIS MAIAC AOD and ground monitoring station PM as training-test data for the regression analysis.

METHODOLOGY
The study is divided into three major components as shown in Figure 1: (1) Gathering of input parameters, (2) pre-processing of satellite images and monitoring station data, and (3) regression modeling.Figure 2 shows the National Capital Region, a metropolitan area composed of four districts divided into 16 cities and 1 municipality with a total land area of 619.57km 2 , which serves as the seat of the government, population, and economy of the Philippines (Chua, et al., 2021).The region is affected by two seasons based on temperature and rainfall, specifically the wet season from June to November and the dry season from December to May.Tomacruz (2018) determined that the Southeast Asian region averages 21 μg/m 3 annually in PM2.5 concentrations which is over twice the recommended value.With the region being the center of population and economy where 2.5 million annual average daily traffic was recorded, several risk factors that may cause hazardous effects are present in the region due to significant air pollution, especially in urban areas (Villas-Alvaren, 2016).Specifically, in terms of PM2.5, the World Health Organization (WHO) discovered the region's average concentration of 30.44 μg/m 3 from 2016 to 2018 far exceeded the annual PM2.5 standard recommended value of 20 μg/m 3 .Particularly, the WHO report for 2016 showed that the Philippines' PM2.5 concentration value of 18.4 μg/m 3 is approximately 80% higher than the indicated safe levels, with Metro Manila registering a value of 55 μg/m 3 (Ambag, 2019).These high levels of ambient PM2.5 cause prolonged outdoor exposure to be hazardous, especially for those with occupations that are required to do so (Estoque, 2020).

Input Parameters for Regression Analysis
In this study, MODIS MAIAC satellite-derived AOD was used as the independent parameter for the regression analysis.MODIS is an instrument carried by NASA's Terra and Aqua satellites launched on December 18, 1999, and May 4, 2002, respectively.One of the algorithms used to retrieve AOD from the MODIS spectral reflectance is the Multi-angle Implementation of Atmospheric Correction (MAIAC). .Two bands are available for analysis namely: Optical_Depth_47 (AODB) and Optical Depth 55 (AODG).MAIAC AOD data (MCD19A2.006)which reports daily AOD in 1-km spatial resolution were directly downloaded using Google Earth Engine from 2017 to 2020.Datasets were clipped using the regional boundary shapefile and images with very high cloud cover were removed.In-situ PM data that were used as predicted variable data were obtained from the 15 ground monitoring stations of DENR placed all over NCR as shown in Figure 3. From these ground monitoring stations, PM observations every hour were extracted from 2017 to 2020.Hourly in-situ PM data were then aggregated to daily averages to match the temporal frequency of the satellite-derived AOD images.

Pre-processing of satellite images and monitoring station data
In-situ PM data were converted to point shapefiles to combine them with the respective AOD concentration data from the satellite images.Concentration values were extracted to commaseparated values files using point sampling.Obtained data from point sampling were combined into one datasheet containing all the data from 2017 to 2020 (yearly), all data for the dry season, and all data for the wet season.Datasets were then converted to a two-dimensional data structure composed of rows and columns called dataframes using a script to prepare them as machine-ready datasets for regression modeling.Negative values and null data values, including random string values, were removed from the data frame.Moreover, outliers were removed through outlier detection techniques.Lastly, data were split into train-test data with 70-30 split percentages.

Regression Analysis
Models were generated using MLR and Extreme Gradient Boosting (XGBoost).MLR is a regression model that involves two or more independent variables in estimating its relationship with a dependent variable using a straight line (Tranmer et al., 2020).On the other hand, XGBoost is an optimized gradient boosting algorithm that applies level-wise tree growth designed to be highly efficient, flexible, and portable (Wade, 2020).Hyperparameters were tuned using Randomized Search Cross Validation with 10 k-folds, iterating different combinations of hyperparameters to determine the best models, yearly and each season, to avoid overfitting and optimize processing time.The best-performing model was then saved using Joblib.Generated model was then tested using the test data, and the feature importance was determined to evaluate which features affected the models the most.

Data Distribution Plots
Data distribution was observed by checking the histogram and box plots of the input parameters for the PM modeling shown in Figure 4, Figure 5, Figure 6, and Figure 7, respectively.PM2.5 values range from 0 to 50 µg/m 3 with the first quantile value of 15 and 3 rd quantile value of 30, while 0 to 110 µg/m 3 with the first quantile value of 38 and 3 rd quantile value of 62 for PM10.On the other hand, AODB values range from 0 to 500 while AODB ranges from 0 to 350.These observations imply that the values concentrate more in areas below the average value of the respective input parameters.

Correlation Analysis
Correlation between variables was first determined before regression modeling.Correlation analysis was performed to determine the strength of the relationship between the defined variables.However, it is necessary to remember that correlation analysis does not define causation between the variables.A positive correlation only denotes a direct relationship between the variables.If there is an increase/decrease in one variable, an increase/decrease can also be observed in the other variable.On the other hand, a negative correlation signifies an increase in one variable while the other variable decreases, and vice versa.Table 1 shows the correlation coefficient between AOD and PM.PM2.5 in general shows a greater correlation to AODB and AODG than PM10 to the independent variables.Models containing all the data from 2017 to 2020 (Yearly model) resulted in the highest correlation to PM2.5 while the dry season model showed the highest correlation between the variables for PM10.Correlation coefficients for all the models between AODB and AODG were very similar and almost equal.This indicates that AODB and AODG were highly correlated with each other and might be showing the same observations over the target areas.The similarity between AODB and AODG might be due to their nature of being AOD with the only difference being one was derived from 0.47 μm and 0.55 μm, respectively, having minute differences for most areas over land.With this result, future models can remove one of the said parameters for a more optimal modeling process and avoidance of multicollinearity between variables.

Multiple Linear Regression Analysis
After correlation analysis, models were generated using multiple linear regression both for all of 2017 to 2020 and dry and wet seasons.The intercept value in a regression model represents the mean value of PM when the AOD is zero, though, a rare occasion for this parameter.On the other hand, the coefficient values for AOD showed negative values for AODB across all the models while positive values for AODG.This denotes that an increase in AODB causes a decrease in the mean PM and vice versa, while an increase/decrease in AODG causes an increase/decrease in the mean PM.Moreover, AODG resulted in higher magnitudes than AODB, implying that AODG affects the values of the PM more than AODB in the generated models.Model parameters including error measurements were also checked to determine the performance of the models.Specifically for RMSE, the resulting models showed relatively low error values with the dependent variable, PM, with minimum and maximum values of 0 to 50 for PM2.5 and 0 to 100 for PM10.The coefficient of determination (R 2 ) for almost all models resulted in positive values significant from zero with the highest percentage of around 2.6%.PM2.5 models showed a better fit than PM10 to AOD, however, underfitting was present for both models of PM2.5 and PM10.This might imply that the variability of PM was not fully taken into account by the independent variables, AOD.Therefore, for the improvement of the model, other variables were added to the modeling process such as meteorological parameters.Moreover, the number of observations used to model PM might not be enough to properly calculate its variability as also seen later on with the gradient boosting models.
Figure 8 and Figure 9 display the scatter plot of in-situ PM vs predicted PM from MLR. Scatter plots showed the concentration of values below the average values which is more predominant in the PM10 models.The clustering of values over specific predicted values or an unbalance in the y-axis can also be seen as a result of the underfitting of the models.
Figure 10 and Figure 11 show the residual distribution plots of the best models for PM2.5 and PM10.Plots showed near-normal distribution curves with a slight skew to the right.Specifically, plots showed residuals having a higher density in the negative area from 0 to -10.This could mean that the model is biased for lower values resulting in a lower fit across all the models.In a similar way of checking the influence of the independent variables in affecting the dependent variable values through the coefficient values, feature importance plots were generated to determine which parameter between AODB and AODG affected the model building the most using gradient boosting regression.Feature importance computes F-scores that represent the importance of each input feature for a given model.The larger the F-score means the higher the effect of that specific input feature in building the model in predicting a variable or PM in this case.

Gradient Boosting Regression
Figure 12 shows the F-scores of each input feature for all the generated models using gradient boosting regression.Overall, AODG resulted in higher F-score values than AODB, even though the difference between these input features was low.
From these results and the ones from MLR, dropping AODB for future modeling processes can be considered for a more optimal modeling process and smaller chances for multicollinearity.Falk and Miller (1992) recommended coefficient of determination values greater than 0.10 to adequately determine the variability of the endogenous variable.On the other hand, Cohen (1988) recommended endogenous variables with R 2 values of 0.26 to be assessed as substantial, while 0.13 for moderate, and 0.02 for weak.
Some of the models resulted in a weak correlation coefficient significant from zero.This implies that the input variables might not be sufficient to properly take into account the variability of PM, however, this does not mean that the other resulting variables from the correlation analysis and feature importance plots are insignificant.On the other hand, it is imperative that to improve the models, additional input parameters or more in-situ monitoring station data are needed to gather more significant results.

PM2.5 and PM10 Maps
Maps were generated using a python script to apply the saved best models with JobLib to the satellite-derived AOD images shown in Figure 13, Figure 14, Figure 15, and Figure 16, respectively.

CONCLUSION AND RECOMMENDATIONS
Regression models were generated for the estimation of PM2.5 and PM10 using MODIS MAIAC AOD satellite images.Models were generated by integrating in-situ monitoring station data with satellite-derived data using MLR and XGBoost.
Results showed low coefficient of determination values and significantly low RMSE values.Cases of underfitting for the models might be the result of the insufficient number of data points, specifically for machine learning algorithms such as gradient boosting regression due to its integrated crossvalidation process when generating the model.This results in models with a coefficient of determination value less than zero.However, PM and AOD showed a correlation, especially for PM2.5 across all the models.Variability and fit may be improved when AOD is combined with other input parameters in estimating PM.
PM2.5 and PM10 concentration estimation maps were generated by applying the best models to the MODIS MAIAC AOD satellite images.
Undergoing improvement of the models is the addition of several input parameters such as meteorological parameters.Specifically, parameters such as wind speed and direction, mean temperature, minimum temperature, temperature, pressure, precipitation, land use/land cover, and other geospatial layers are added to encapsulate the variability of PM2.5 and PM10.Additional years of analysis might also be helpful if air quality monitoring station data are available.

Figure 1 .
Figure 1.PM modelling workflow using MLR and gradient boosting algorithm.

Figure 2 .
Figure 2. The National Capital Region of the Philippines.

Figure 3 .
Figure 3. Air quality monitoring stations in the NCR.

Figure 5 .
Figure 5. Box plots of the input parameters for PM2.5 models.

Figure 6 .
Figure 6.Histogram of the input parameters for PM10 models.

Figure 7 .
Figure 7. Box plots of the input parameters for PM10 models.

Figure 9 .
Figure 9. In-situ PM10 vs. Predicted PM10 plots for the yearly model (top left), dry model (top right) and wet model (bottom).

Figure 10 .
Figure 10.In-situ PM2.5 residual plots for the yearly model (top left), dry model (top right) and wet model (bottom).

Figure 11 .
Figure 11.In-situ PM10 residual plots for the yearly model (top left), dry model (top right) and wet model (bottom).

Figure 12 .
Figure 12.Feature importance of the generated gradient boosting models for PM estimation.

Figure 13 .
Figure 13.Sample PM2.5 estimation maps using the best dry model for March 2017 to 2020.

Figure 14 .
Figure 14.Sample PM2.5 estimation maps using the best wet model for July 2017 to 2020.

Figure 15 .
Figure 15.Sample PM10 estimation maps using the best dry model for March 2017 to 2020.

Figure 16 .
Figure 16.Sample PM10 estimation maps using the best wet model for July 2017 to 2020.

Figure 17 .
Figure 17.Map legends in µg/m 3 for PM2.5 for March, PM2.5 for July, PM10 for March, and PM10 for July (left to right).Map gradient of Blue-Yellow-Red was used to show the low to high concentrated areas of PM as shown in Figure 17.White spaces are null values due to cloud cover.Even though it is difficult to determine clusters of PM from the maps alone, PM2.5 maps for March 2017 to 2020 show a high concentration of particulate matter in the Northwest -West region of the image with a lower concentration near the central area of NCR.

Table 2 .
Coefficient values of the generated PM models using multiple linear regression.
Table2shows the coefficient values of the resulting linear regression models for the yearly, dry, and wet seasons.

Table 3 .
Table3shows error measurements of the generated models, specifically Mean Absolute Error (MAE), Mean Squared Error (MSE), and RMSE.The lower the error values for all these model parameters implies a better performance of the models.Model error measurements of the generated PM models using MLR.

Table 4 .
Table4summarizes the train and test RMSE of the generated models using gradient boosting regression.Similar to the results from MLR, the resulting RMSE using gradient boosting regression was relatively low in comparison to the dependent variable's range.Moreover, computed RMSE between the train and test sets showed a minute difference between each other, indicating no case of overfitting for all the generated models.R 2 resulted in the highest training value of 2% but low test scores.Although gradient boosting in general results in better accuracy in modeling than MLR in most cases, machine learning algorithms require an ample amount of training data to properly produce models that were cross-validated with optimized hyperparameters, especially, when there are only two independent variables with 15 observation points.Model error measurements of the generated PM models using gradient boosting regression.