ASSESSMENT OF EXPOSURE TO AMBIENT PM2.5 BASED ON GAP-FILLED AEROSOL OPTICAL DEPTH AT URBAN SCALE

: Air pollution has been a crucial issue affecting human health and has drawn more and more attention in the world. The assessment of exposure to PM2.5 of urban residents based on remote sensing is challenging because of the data deficiency in aerosol optical depth (AOD) and the low spatial resolution. This article is devoted to adopt an approach with 2 gradient boosting decision tree (GBDT) models to fill the gaps in AOD and derive continuous PM2.5 distribution at urban scale. Then, the assessment of exposure to PM2.5 in Beijing was conducted. First, Simplified High Resolution MODIS Aerosol Retrieval Algorithm (SARA) was employed to obtain daily AOD from September 2016 to February 2017 at 500m resolution. Then we used the first GBDT to derive the gap-filled SARA AOD and the second GBDT to estimate PM2.5 spatial distribution based on multi-source data. Furthermore, population weighted exposure (PWE) levels of PM2.5 and population proportion exposed to PM2.5 concentration were estimated by PM2.5 distribution and population density data. The result demonstrates that both two GBDT models performed well with cross validation (CV) R 2 of 0.86 and 0.85 on AOD gap-filling and PM2.5 concentration estimation respectively. The areas with high PM2.5 concentration are mainly distributed in the east and south of the city but the areas with higher PM2.5 exposure are mainly distributed in the urban centre. 80% people in Beijing are affected by PM2.5 pollution in autumn and Overall, the approach this applied and the analysis results are very useful for epidemiological investigation and air


INTRODUCTION
As the rapid development of urbanization and economy in recent decades, air pollution in urban areas is getting worse and worse. Due to more and more people aggregate to cities, more than 90 percent people in the world are exposed to air pollution (Shi et al., 2019). In urban areas, fine particulate matter (with aerodynamic diameters less than 2.5 μm, PM2.5) has been viewed as one of the most serious air pollutants due to its convenient transmission and easy inhalation by human (Sun et al., 2016). Exposure to ambient PM2.5 has an adverse impact on human health with increasing the risk of cardiovascular problems, respiratory problems, stroke and many other diseases (Brauer et al., 2015;Dominici et al., 2006;Huang et al., 2018;Thurston et al., 2015). In addition, the premature death of 8.9 million people had relationship with PM2.5 in 2015 (Burnett et al., 2018). Therefore, it is of a great significance to estimate PM2.5 concentration and investigate PM2.5 exposure especially in urban areas. Satellite-based PM2.5 concentration estimations are applied more and more widely. AOD data that remote sensing satellites offer has been proved to be closely correlated with PM2.5 concentration (Li et al., 2005;Lin et al., 2015). Hence, it is available to obtain widely spread PM2.5 distribution through AOD products. For example, Hu et.al (2017) derived spatial PM2.5 distribution of the conterminous United States based on moderate-resolution imaging spectroradiometer (MODIS) AOD products (MYD04_L2) of 10km resolution. Wei et.al (2019) applied multi-angle implementation of atmospheric correction (MAIAC) AOD data obtain the PM2.5 distribution of China with high accuracy. Whereas, there are numerous data deficient gaps in satellite-based AOD products due to clouds (Yu et al., 2015). It is also a big challenge for accurate PM2.5 estimation because the missing of AOD will cause the data deficiency of PM2.5 samples and induce biases to the result furthermore. Some studies used some simple methods such as spatial interpolation and imputation to fill the gaps in AOD (Kloog et al., 2011;Liang et al., 2018). Although these methods are convenient to be applied, the results were poor when there is too much missing neighbourhood data. Thus, complicated and robust methods like machine learning may be considered in AOD gap-filling. Due to large energy consumption caused by urbanization, Chinese PM2.5 pollution is particularly serious in the world (Van Donkelaar et al., 2014). Through the efforts on environmental policies of Chinese government, the overall average PM2.5 concentration in China is 45 μg/m 3 in 2017 which decreased from 67 μg/m 3 in 2013 . However, compared with the "good" level national standard in GB3095-2012 (China's National Ambient Air Quality Standard, CNAAQS) of 35 μg/m 3 , China still has a long way to go (China, 2012). In addition, it is noted that the exposure to PM2.5, i.e. population weighted exposure (PWE) with considering the distribution of population can better depict the negative effects of PM2.5 on human health than using PM2.5 concentration. Donkelaar et.al (2014) combined three satellitederived PM2.5 sources to evaluate PM2.5 exposure world widely. He and Huang (2018) used geographically and temporally weighted regression (GTWR) to derive PM2.5 distribution and calculate the exposure to PM2.5. It was found that over 92% people in China are facing the threat of PM2.5 pollution. The PM2.5 exposure in Australia was estimated and only 17% population lived in the areas with PM2.5 concentration higher than 8 μg/m 3 (Knibbs et al., 2018). However, the information these studies provided is only adaptive at large scale. There were few studies conducted at urban scale due to the limitation of spatial resolution. The resolution of PM2.5 data most studies adopted is higher than 1km but the studies in urban-scale requires data with finer resolution. Our work is aimed to use the SARA AOD products with 500m resolution to assess the exposure to PM2.5 with taking Beijing as the example. GBDT, a good machine learning method, was applied to fill the gaps in SARA AOD and estimate PM2.5 concentration. Then, the PWE to PM2.5 and population proportion exposed to PM2.5 concentration was calculated further based on population density.

Study Area
Beijing (as Figure 1 shows) is the capital of China, located in the North China Plain (NCP) with total area of 16410.54 km 2 and population of 21.54 million. Due to the rapid economic development and urbanization, Beijing has been the area with the most serious air pollution burden in China .

SARA AOD
Simplified Aerosol Retrieval Algorithm (SARA) is a stable AOD retrieval algorithm with good reliability under complex atmospheric environment (Bilal et al., 2013;Bilal et al., 2014). More importantly, the spatial resolution of SARA AOD is 500m, which is much higher than other AOD products and more adaptive for urban scale. SARA have three assumptions: (1) The surface is Lambertian. (2) Single scattering approximation. (3) The single scattering albedo and asymmetric factor do not vary spatially over the region on day of retrieval (Bilal et al., 2013). In addition, traditional retrieval lookup table is not required in the retrieval process so it is convenient to implement this AOD retrieve method. It directly obtains AOD results based on AOD ground observation data, topography and angle data, Top of Atmosphere (TOA) radiance data and surface reflectance data. Here, we used ground AOD data from Aerosol Robotic Network (AERONET, https://aeronet.gsfc.nasa.gov/) and downloaded TOA product MOD02HKM, topography and angle product MOD03 and surface reflectance product MOD09 from the NASA website (https://ladsweb.nascom.nasa.gov/). SARA AOD is also deeply affected by clouds like other AOD products. The existing of clouds disturbs the acquaintance of correct data. We used MOD35, the cloud mask product of MODIS, to remove the pixels covered by clouds. However, this approach conducted many gaps in SARA AOD so we would use a GBDT to fill them and obtain the full-covered SARA AOD.

PM2.5 Concentration
Ground-level PM2.5 observation data used in this article was downloaded from Ministry of Environmental Protection of China (MEPC). The hourly PM2.5 data spanned from September, 2016 to February, 2017 (basically the autumn and winter in Beijing) in 35 ground monitoring sites of Beijing was collected and calculated further for daily average PM2.5 concentrations. Among the 35 sites, there are 17 sites located in core urban area (Haidian district, Dongcheng district, Xicheng district, Chaoyang district, Shijingshan district and Fengtai district) and 18 sites evenly distributed in suburb areas.

Population Density
For assessing the exposure to PM2.5, population density raster data with 1km was obtained from the Data Center for Resources and Environmental Sciences (DCRES) (http://www.resdc.cn). The population density raster product was derived based on land use data, night light remote sensing data and the distribution of residential density. It was also reprojected and resampled to the defined 500m grid mentioned in 2.4.

Gradient Boost Decision Tree (GBDT)
GBDT is one of the boosting learning methods in machine learning family. It is an ensemble learning method containing a series of based learners. The parameters of GBDT is complicated compared with other ensemble learning methods which ensure its outstanding performance on processing multidimensioned and massive data (Friedman, 2001;Reid et al., 2015). Thus, GBDT is a good option for AOD missing data recovering and PM2.5 concentration estimation. In terms of model structure, the form of based learner in GBDT is decision tree which is similar to Random Forest (RF). However, unlike RF, these based learners are not independent due to each one is generated based on the residues of the previous one. The predicted result is the weighted addiction of all based decision trees. The form is as (1) shows. Here, is the addictive result of all previous − 1 trees.
is the decision tree trained based on the former learner.
is the weight of .
We constructed 2 GBDT models for AOD gap-fill and PM2.5 estimation respectively. In the model for AOD gap-fill (call it model 1), all auxiliary data mentioned in 2.4 and the day of year (DOY) were used as the independent variables. Compared with model 1, the model for PM2.5 estimation (call it model 2) added the full-covered AOD obtained from model 1 as the major independent variable. Some key parameters in both two models were selected, which is shown in Table 1. GBDT can evaluate the importance of all predictors by its variable importance measure. Due to there were 2 GBDT models for AOD gap-fill and PM2.5 estimation, the variable importance measures (VIM) of the two models were calculated separately. VIM was evaluated by the variable using frequency in model construction, which is to say the more the variable are adopted, the more its contribution to the model. A sample based 10-fold CV was adopted for validating the performance of two GBDT models. Three measurements including R square (R 2 ), mean predictive error (MPE), root mean square error (RMSE) are used for the evaluation. Both the construction of GBDT and the CV evaluation were done based on the Python 3.7 environment.

Parameters
Model 1

Ambient PM2.5 Population Weighted Exposure Assessment
For investigating the impact of PM2.5 on urban residents, ambient PM2.5 population exposure was assessed based on the PM2.5 estimation results. The total PWE was calculated based on (2) Zhan et al., 2017). Here, and are the population and the PM2.5 concentration in grid respectively.
(2) Furthermore, we calculated the PWE of each grid as (3) shows.

Model Performance
The CV performance of two GBDT models are shown in Table  2 and Table 3. For model 1, the result shows a good performance with CV R 2 of 0.86, MPE of 0.10 and RMSE of 0.12. The performance in October is the best with CV R 2 of 0.91 while that in November is the worst with CV R 2 of 0.77. For model 2, the performance is also good with CV R 2 of 0.85, MPE of 22.47 μg/m 3 and RMSE of 30.94 μg/m 3 . CV R 2 , calculated based on test sets, is an important indicator related to overfitting with higher value indicating weaker overfitting relevantly. Overall, the performance of both models are good despite some tolerated overfitting with CV R 2 lower than 0.8 in few months. The best performance and the worst performance are in January and November respectively. The reason for the different performance among months may be related to the sample size. The good performance of two models suggests that GBDT can capture the characteristics of both AOD and PM2.5 at urban scale. Zhang et.al applied RF in MODIS 3km AOD gap filling and PM2.5 estimation using Sichuan basin as the study area with CV R 2 of 0.95 and 0.86 respectively . Though the accuracy of two GBDT models in our work is a little lower in contrast, it is acceptable because of the differences in study area and period range. Table 4 shows the VIMs of two models. PBLH plays a very significant role in both two models. That is because AOD is the attribution of aerosol in the vertical structure and PBLH determines the range of vertical aerosol distribution, In addition, PBLH affects the ratio of PM2.5 to AOD with a negative relationship so it is also important for PM2.5 estimation models (Sun et al., 2018). DOY is another important factor which is also demonstrated in some other studies (Guo et al., 2017;Zhao et al., 2019;Zheng et al., 2017). Both AOD and PM2.5 vary greatly temporally and sometimes there is a big difference between two continual days. In addition, there is a huge difference on performance between winter and autumn. On one hand, in the winter (includes December, January and February) of Beijing, aerosol accumulates in the low height which can aggregate the PM2.5 pollution. On the other hand, due to the requirement of heating, more energy consuming are conducted in winter which also produces many pollutants. In addition, some scholars have found that other pollutants are also helpful for the improvement of model accuracy (Wang et al., 2020). It is worth trying to apply other pollutants such as SO2 and NO2 as the independent variables in our future research.  Figure 2 shows the differences between the average cloudremoval AOD and the average gap-filling AOD. It can be obviously found that both of them approximately share the same value range from 0.16 to 0.57 while the average gapfilled AOD is smoother. The reason for this phenomenon is the random disturbance of clouds which cause the non-continuous change of average AOD in space. Without AOD gap-filling process, these gaps will not only cause the unavailability of daily estimation but also eventually induce some biases to the average estimation which inevitably affects the accuracy of human health investigations such as PWE level assessment. In terms of space, AOD values are low in western and northern mountains and high in eastern and southern flatten plains. Other than terrain factor, socioeconomic factors also play important roles affecting AOD distribution. For example, most populations and industries are concentrated on the southern and eastern areas conducting much emissions.

Spatial Distribution of PM2.5 Concentrations and Population Density Weighted Exposure
PM2.5 concentration spatial distribution was retrieved based on the second GBDT model with using the full-covered AOD as the major predictor. The average PM2.5 spatial distribution is shown in Figure 3. The spatial distribution pattern of average PM2.5 demonstrates a decreasing trend from southeast to northwest which is similar to AOD. This also proves the strong relationship between AOD and PM2.5. The range of average PM2.5 in autumn and winter of Beijing is from 47 μg/m 3 to 144 μg/m 3 . According to CNAAQS, when PM2.5 concentration is higher than 75 μg/m 3 , the environment quality can be defined as 'pollution'. Thus, it can be seen that the PM2.5 pollution issue in autumn and winter of Beijing is extremely serious with most areas of PM2.5 concentration higher than that level. Furthermore, we obtained the PWE combined with population density data and PM2.5 average distribution. We used Natural Breaks (Jenks) method divide it into 7 levels (higher levels, heavier exposure to PM2.5) as Figure 4 shows. From the perspectives of spatial PWE distribution, it demonstrates that the high level PWEs are mainly distributed in urban central areas. This is different from the distribution pattern of PM2.5 concentration that southeast areas of Beijing suffer heavier PM2.5 pollution. It suggests that Beijing government should pay more attention to the pollution transmission from southeast areas to urban central areas. Total PWE of the whole period and six months were calculated as Table 5 shows. Among all six months, residents in Beijing are most affected by PM2.5 pollution in December with PWE of 124 μg/m 3 . Moreover, there is no month's average PWE reaches the "good" air quality in CNAAQS (less than 35 μg/m 3 ). Figure 5 shows the population proportion in Beijing exposed to PM2.5 concentration over the six months. On the whole, more than 80% people of Beijing suffered from PM2.5 pollution (higher than 75 μg/m 3 ). In addition, it is noted that there are 16% and 6% people living in the areas with PM2.5 concentration exceeding 150 μg/m 3 (the standard of "serious pollution" in CNAAQS) in December and January respectively. Thus, the series of analysis results above underline the importance and the urgency of air pollution reduction in Beijing. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2020, 2020 XXIV ISPRS Congress (2020 edition) Figure 4. PWE level spatial distribution in autumn and winter of Beijing. Figure 5. Beijing population proportion exposed to PM2.5 concentration in different months.

CONCLUSION
To summary, this article applied two GBDT models to derive the gap-filled SARA AOD and PM2.5 concentration distribution with 500m resolution. Based on the result of PM2.5 estimation, PWE and population proportion exposed to PM2.5 in Beijing are derived. The performance is good with CV R 2 of 0.86 and 0.85, MPE of 0.10 and 22.47 μg/m 3 and RMSE of 0.12 and 30.94 μg/m 3 for AOD gap-filling (model 1) and PM2.5 concentration estimation (model 2) respectively. Through the results of exposure to PM2.5, it is found that high levels of PWE mainly assemble in urban central areas. Overall, over 80% people in Beijing live in the areas with PM2.5 concentration higher than 75 μg/m 3 which meets the standard of "pollution" in CNAAQS and the pollution situation is serious in December and January. This work also has some disadvantages. First, the population density data doesn't vary temporally thus the dynamic changes of exposure to PM2.5 may not be captured precisely. Second, the PM2.5 observations in our study are relatively low which causes a little overfitting in few months. Thus, we will extend the study period and obtain more observations in the future. In addition, the PM2.5 estimation model's accuracy may be improved by introducing other pollutants as independent variables. In general, the approach and the information this article provide are valuable for air pollution abatement in the future.