A Spatiotemporal Prediction Framework for Air Pollution Based on Deep RNN

: Time series data in practical applications always contain missing values due to sensor malfunction, network failure, outliers etc. In order to handle missing values in time series, as well as the lack of considering temporal properties in machine learning models, we propose a spatiotemporal prediction framework based on missing value processing algorithms and deep recurrent neural network (DRNN). By using missing tag and missing interval to represent time series patterns, we implement three different missing value fixing algorithms, which are further incorporated into deep neural network that consists of LSTM (Long Short-term Memory) layers and fully connected layers. Real-world air quality and meteorological datasets (Jingjinji area, China) are used for model training and testing. Deep feed forward neural networks (DFNN) and gradient boosting decision trees (GBDT) are trained as baseline models against the proposed DRNN. Performances of three missing value fixing algorithms, as well as different machine learning models are evaluated and analysed. Experiments show that the proposed DRNN framework outperforms both DFNN and GBDT, therefore validating the capacity of the proposed framework. Our results also provides useful insights for better understanding of different strategies that handle missing values.


INTRODUCTION
Air pollution remains a serious concern in developing countries such as China and India and has attracted much attention.Typical sources of air pollution include industrial emission and traffic emission, and the main pollutants are PM2.5, PM10, NO2, SO2, O3 etc.Among the pollutants PM2.5 has attracted immense attention.PM2.5 is fine particulate matter or particles that are less than 2.5 micrometers in diameter, usually consisting of solid or liquid particles.The correlation between health risk and the concentration of air pollutants have been studied (Stieb et al., 2008, Chen et al., 2013).Organizations and governments, such as the World Health Organization (WHO, 2006), the USA Environmental Protection Agency (Laden et al., 2000), Japan (Wakamatsu et al., 2013) have implemented policies to support air pollution countermeasures.
Presently the models for predicting air pollutants can be classified into two types.The first type includes mechanism models that tracks the generation, dispersion and transmission process of pollutants, and predictive results are given by numerical simulations.Two commonly used mechanism models are CMAQ (Byun & Ching, 1999) and WRF/Chem (Grell et al., 2005).Both of these two models incorporate physical and chemical models.Physical models are used to generate meteorological environment parameters, while chemical models are for pollutant transmission simulations.The second type of models usually used in air pollution predictions are statistical learning models or machine learning models.These models attempt to find patterns directly from the input data, rather than numerical simulations.Some of the widely used models are linear regression, Geographically Weighted Regression (Ma et al., 2014), Land Use Regression (Eeftens et al., 2012), Support Vector Machine (Osowski et al., 2007) and Artificial Neural Networks (Voukantsis et al., 2011, Feng et al., 2015).Various attempts have also been made to combine different methods in order to achieve better performance (Sanchez et al., 2013;Adams & Kanaroglou, 2016).A form of neural networks known as recurrent neural networks (RNN) has exhibited very ideal performance in modelling temporal structures (Graves & Schmidhuber, 2009, Lipton et al., 2015).While open datasets grow more rapidly than ever, traditional machine learning methods may not be able to depict complex patterns within the massive datasets.But since the difficulty of training huge neural networks has been alleviated (Hinton & Salakhutdinov, 2006, Hinton et al., 2006), and hardware developments grants researchers stronger computational resources, constructing deep neural networks (DNN) for learning complex patterns has become possible.
On the other hand, forecasting air pollution requires time series data, which may contain discontinuities due to malfunction of sensors, delay of networks etc. Therefore the gaps within data must be handled before training machine learning models.Various solutions have been proposed to alleviate the missing data problem, including smoothing, interpolation and kernel methods (Kreindler et al., 2006;White et al., 2011, Rehfeld et al., 2011).But many of these methods require knowledge of the full dataset before fixing gaps, so the fixing phase and model training phase have to be separated.This may influence the efficiency in a real world application.
In this paper, our goal is to develop a spatiotemporal framework that is able to deal with missing values in time series data.To exploit the informative missingness patterns, we design three real-time/semi real-time interpolation algorithms.Then we introduce a spatiotemporal prediction framework incorporating deep recurrent neural networks (DRNN) and the interpolation algorithms.Numerical results demonstrate that our proposed DRNN outperforms strong baseline models including deep feed forward neural networks and GBDT.The main contributions of this paper as follows: (a) We introduce three missing value fixing algorithms by characterizing the missing patterns of not missing-completelyat-random time series data.
(b) We propose a general spatiotemporal framework based on deep recurrent neural works (DRNN), that takes advantage of both spatial and temporal correlations.The capacity of the framework is further enhanced by the fixing algorithms.

Spatiotemporal Forecasting
Both spatial and temporal information should be considered when forecasting the spatiotemporal distribution of air pollutants.Firstly, the air quality data at given spatial point or within certain area has internal temporal correlation.Historical states can affect current and future states, e.g. the air quality during the last hour will affect the air quality during the next hour.Secondly, air pollutants may disperse or transmit through the atmosphere, and this process is highly related to wind direction and wind speed, therefore air quality of adjacent areas will also influence the local air quality.In order to construct a precise forecasting model, both spatial and temporal correlations should be taken into account.The sources of air pollutants can be classified into two different types: local source of emission and outside emission that transported into local area, and their properties can be depicted by temporal and spatial correlations, respectively.Spatial and temporal correlations are illustrated in figure 1, where blue circles represent adjacent points, green circles represent target point, dashed lines are the temporal correlations between local air quality conditions, and the red arrows are spatial correlations.

Fixing Missing Values
One way of handling missing values in time series prediction is to directly omit the missing sections, and use only the consecutive parts.But this method is only applicable when missing values do not occur randomly and frequently.Also, the missing pattern of time series data may also contain information that could improve the performance of model prediction.The other option is to fix the missing values by resampling or interpolation, but these methods may require knowledge of the whole dataset before dealing with missing data, and may result in a two-staged modelling process (Wells et al., 2013).Recent works tried to model explicitly the missingness of various datasets (Wu et al., 2015), or interpolate according to the time series information of missing data in health care dataset (Che et al., 2016).We implemented three missing value fixing methods based on similar ideas for air pollution time series data.These methods are real-time or semi real-time because the missing data can be fixed in an "online" or batched fashion.Let n st ={X1,X2,…,XT} be a time sequence with missing values.
For each observation We implemented and compared three different interpolation methods for fixing missing values in air quality data sequence: (a) Fix the missing values using the latest valid observation (forward-fix): Where t'(<t) denotes the timestep that the dth dimension is observed, and d t x ' denotes the latest observed value of the dth dimension.This method can fix the missing values before it sees the whole dataset, therefore it is a real-time algorithm.(b) Fix the missing values using mean value of the same time point in the whole month (mean-fix).

 
Where d x ~ denotes the average value of all valid obervations of the dth dimension at the same timepoint each day in the same month.The method produces substitution of missing values after reading data for a whole month, so it is semi real-time.(6) (c) Fix the missing values using a weighted sum of (a) and (b).
The logic is that for a given observation variable, there could be a long term default value, but it could also be affected by sudden changes.Therefore by assigning a exponential decay weight, we can combine latest observation and long term average value.In this combination, the effect of latest valid observation decreases as the missing period extends.Since it combines the two methods above, it is a semi real-time method.

RNN and LSTM
Recurrent neural network (RNN) is a variant of feedforward neural network (FNN): FNN consists of layers stacked on top of each other, where each layer is composed of neurons, and all connections between layers follow the same direction.RNN introduces cyclic structure into the network, which is implemented by self-connection of neurons.By using selfconnected neurons, historical inputs can be 'memorized' by RNN and therefore influence the network output.The 'memory' that RNN holds enables it to outperform FNN in many realworld applications.The inference process of RNN is similar to that of FNN, which is finished by forward propagation.Training of FNN is done by back propagation (BP) algorithm.While RNN models sequence data and takes the transfer of 'memory' into account, therefore its training process should stack BP results over time dimension, resulting in the back propagation through time (BPTT) algorithm.For a basic RNN structure composed of one input layer of I neurons, one hidden layer of H neurons and one output layer of K neurons, its forward and back propagations are as below.The input of the network is a sequence X of length T. The forward propagation process is as follows: x is the value of ith dimension on timestep t, ij w denotes the weight between neuron i and j.The input and activation of neuron j at timestep t are denoted by t j a and t j b .h  represents the activation function of neuron h.The BPTT algorithm of RNN is as follows: Where L is the loss function, and t j  is the gradient of loss function over input of neuron j at timestep t.After calculating the gradients, weights in the network are updated by gradient descent algorithm.One drawback of using RNN is that through the extension of timesteps, gradient may tend to be 0, leaving the parameters of a network with long-term dependency hard to train (Bengio et al., 1994;Hochreiter et al., 2001).This problem is called 'vanishing gradient'.
In order to solve the vanishing gradient problem, the Long Short-Term Memory (LSTM) structure was introduced (Hochreiter & Schmidhuber, 1997).LSTM has the similar basic structure as RNN, but the neurons are replaced by memory blocks.Each memory block contains one or more memory cells and three nonlinear units (gates): input gate, output gate and forget gate.By doing matrix multiplication, the input gate, output gate and forget gate controls the input, output and state reset of the memory cell, respectively.Two kinds of information flow exist within LSTM, the first is from each memory block to other blocks/neurons, e.g. the output value of memory cell.And the second is within the same memory block, e.g. the cell state or 'memory' of the memory cell, the input of the memory cell, and the activation of each gate unit.The gates ensure that gradient information of LSTM will not vanish through back propagation, thus enable LSTM to learn dependencies across long time period.Parameters of LSTM are trained using BPTT.
Core structure of LSTM is illustrated as follows (Graves, 2012):

Methods and Implementation Details
In Index of agreement is an dimensionless index proposed by Willmott to assess the average loss of model predictions (Willmott, 1981): The proposed models and baseline models are implemented using Python, Theano, Keras and Scikit-learn (Al-Rfou et al., 2016;Chollet, 2015;Pedregosa, 2011), and executed on a computer with Intel Core i5-4590 CPU 3.30 GHz, 16 GB RAM and NVIDIA GeForce GTX 750 Ti graphics card.

Quantitative Results and Discussion
Precision measurements of models when performing 1 and 8 hours prediction are provided separately in table 6 and table 7 (measurements for 2~7 hours prediction are not presented due to length limit of this paper).From the comparisons we may obtain two basic conclusions: (a) In short-time prediction (< 4 hours), the models that are based on forward-fix have the best performance, and models based on mean-fix are poorer than those based on forward-fix or decayfix.(b) With the same input data and similar network structure, deep recurrent neural networks get better results than deep feed forward neural networks and GBDT.A possible explanation for the first conclusion is that air pollution events in Jingjinji area are mostly due to sudden changes of atmosphere conditions.Therefore when doing shorttime predictions, forward-fix can always stay close to the original air quality fluctuation trend, while mean-fix may oversmooth the sudden events in original data, resulting in a less precise model.But for the performances of long-time predictions (≥ 4 hours), models based on forward-fix may not always be the best choice.For some long-time prediction tasks, both DRNN1 and DRNN2 achieve best results when using decay-fix.This is illustrated in figure 5 and figure 6.When using DRNN1 to predict PM2.5 concentrations for 7~8 hours in the future, it achieves best result when using decay-fix.While DRNN1 has the best performance with decay-fix when predicting for 4~5 hours in the future.These figures illustrate that considering the long-term average pattern may also improve the performance of predicting models.
As for the second conclusion, we can get some more profound results if we compare different models based on the same fixing algorithm.Below are the performance measurements of models based on decay-fix method:   By spatially interpolating the time series prediction results, we can get the spatiotemporal distribution of air pollutants in the study area.One heavy pollution event was reported on November 10 th , 2014, therefore we use the proposed DRNN2 based on three fixing algorithms to generate hourly predictions of PM2.5, and compare their forecasting performances.The PM2.5 concentration at each station 1 hour in the future is predicted, using historical data from the past 48 hours, then we use inverse distance weighted interpolation to generate spatial distribution at each future time point.The predicted spatiotemporal PM2.5 distributions from 0:00 to 3:00 a.m. on that day are illustrated in figure 11.According to history data, records between 1:00 am to 2:00 am were missing on November 10 th , 2014, and the heavy pollution events starts on November 9 th .Our results show that the region around Shijiazhuang (station id 1028A) should be heavily polluted during the missing period, but prediction results of mean-fix based method only shows light pollution, while the other two methods predicts the heavy pollution successfully.This is consistent with our assumption that methods based on mean-fix tend to smooth out the trend between 1:00 am to 2:00 am.
By plotting the spatial distribution of meteorological conditions, we can also find some other properties of air pollutions in this area: (a) Spatiotemporal distribution of humidity has real-time correlation with PM2.5, which is consistent with the requirements of smog generation.(b) Negative correlation exists between wind speed and PM2.5 concentration, and the correlation has a 1~2 hours' lag, suggesting that smog in the area is highly affected by wind.

CONCLUSIONS
In this paper, we proposed novel deep learning frameworks that can efficiently handle missing values in spatiotemporal forecasting tasks.The motivation is that real-world time series dataset is prone to be discontinuous, and models can be enhanced if the gaps within data are fixed properly.In light of this, we proposed three real-time/semi real-time fixing methods that impute the missing values in an 'online' or 'batch' way.We have then introduced a deep recurrent neural network constructed with LSTM on top of the fixing methods.Numerical results on datasets of Jingjinji area showed that by taking advantage of the 'memory' property, neural networks with LSTM outperforms baseline models such as deep feed forward neural networks and GBDT.Our DRNN framework can predict both sudden heavy pollution events and average patterns with relatively high precisions.Performances of three fixing methods revealed that forward-fix is generally the best choice among the three methods, which is consistent with the fact that air pollution in Jingjinji area are often caused by sudden changes of atmosphere environments.But decay-fix may achieve better results than the other two in long time predictions (4~8 hours), showing that adding long term average patterns may improve model accuracy.

Figure 1 .
Figure 1.Spatiotemporal correlations For N different monitoring stations, the input dataset can be denoted as N time series, ST={ 1 st , 2 st ,…, N st }, where

Figure 2 .
Figure 2. Data flow of spatiotemporal forecasting


be the missing period of the observation's dth dimension, which is the number of timesteps since the last time this dimension has valid value.d t  can be represent as below by missing mask d t m and timestep t

Figure
Figure 3. Structure of LSTM memory block 3. EXPERIMENTS3.1 Data DescriptionThe study area is Jingjinji area of northern China, which suffers from severe air pollution events that frequently occur during heating seasons.The data comes from APIs provided by the Ministry of Environmental Protection of PRC and China Meteorological Administration.Two kinds of original data are used: (a) air quality data, including air quality records from monitoring stations, and station information, (b) meteorological data at county level.There are 80 air quality monitoring stations and 25 corresponding counties for meteorological data in Jingjinji area.Our dataset covers the period from September 2013 to January 2015, and by performing fixing algorithms, we make up of the discontinuous parts of the original data.Statistics of the data before and after fixing are shown in table 1.
our proposed framework, spatial and temporal correlations are represented by neighbouring stations and the 'memory' of LSTM, respectively.The final input includes 5 kinds of information: (a) local air quality properties, e.g.PM2.5, PM10, O3, SO2, NO2, CO, (b) local meteorological properties, e.g.temperature, wind direction, wind speed, humidity, (c) air quality of neighbouring stations, (d) time properties, e.g.weekday, date, month and hour, (e) spatial properties, e.g.longitude and latitude of stations.The dimensions of inputs are as

Figure 7 .
Figure 7. RMSE of models based on forward-fix
Therefore the spatiotemporal forecasting task can be defined as: At given timestep t, find a subset STsub of ST for target station n.Then predict the value of n on timestep (t+1,t+2,…,t+F) based on the historical records of

Table 1 .
Statistics of data before and after fixing missing values.

Table 2 .
Features of input data On top of three kinds of missing value fixing algorithms (forward-fix/mean-fix/decay-fix), we propose two deep neural networks based on LSTM (DRNN-1 & DRNN-2).GBDT and two deep feed forward neural networks (DFNN-1 & DFNN-2) are used as baseline models.DFNN shares the basic structure with DRNN, but all layers of DFNN are fully-connected layers.Network structure of DFNN1, DFNN2, DRNN1 and DRNN2 are shown in figure 4. Structure details of GBDT and neural networks are provided in table3 and table 4, respectively.

Table 4 .
Structures of deep neural networksTraining details of the deep neural networks are as follows:

Table 5 .
(Srivastava et al., 2014)neural networksDeep neural networks may suffer the problem of overfitting, therefore we use dropout to regularize the neural networks, and the basic idea is to randomly remove neurons and connections from network during training(Srivastava et al., 2014).In our model, the dropout rate is set to 0.1.Performances of each model is measured by RMSE, MSE and IA (index of agreement) defined below: