MAIZE YIELD ESTIMATION IN KENYA USING MODIS

Department of Geomatic Engineering and Geospatial Information Systems, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya bkenduiywo@jkuat.ac.ke Environmental Science and Policy, University of California, Davis, USA (bkenduiywo,anighosh,rhijmans)@ucdavis.edu Alliance of Bioversity International and CIAT, Africa Hub, Nairobi, Kenya a.ghosh@cgiar.org Regional Centre for Mapping of Resource for Development, Nairobi, Kenya lndungu@rcmrd.org


INTRODUCTION
In Kenya, crop production is a vital contributor to food security and employment. The sector directly accounts for about 26% and indirectly for another 25% of gross domestic product (Machado and Paglietti, 2015;Kenya National Bureau of Statistics, 2017). Maize is the main staple food in Kenya. Kenya has about 2.1 million ha of maize, more than 40% of the total cropland area. Maize yields are variable, as they are affected by droughts and pests. For example, Fall Army Worm infestations led to a drop in maize production by 6.3% in 2017 (Kenya National Bureau of Statistics, 2017) leading to a severe maize shortage. A quantitative and spatially-explicit understanding of variation in maize yield can support better investments, more efficient markets, and improved policy making. If yield estimates are timely, they can be used to avert food shortage through appropriate interventions such as imports.
Here we investigate the use of remote sensing vegetation metrics from 8-day Moderate Resolution Imaging Spectroradiometer (MODIS) products to estimate maize yields in Kenyan counties. We anticipate that remote sensing can provide cheap, early, and perhaps more accurate maize production estimates than the estimates based on ground based government surveys (Chivasa et al., 2017).
In the present study, we use SVR and RF regression models to predict maize yields based on MODIS vegetation indices in maize producing counties in Kenya. The counties were grouped into homogeneous regions with similar maize phenology. Only maize pixels extracted from an existing cropland map were used as described in Section 2.2. This cropland map was developed in 2015 using Landsat data. The process involved visual image interpretation by an analyst and guided on screen digitization. Therefore, vegetation indices derived from maize only pixels, were aggregated to county boundaries and used to model maize yields based on reference county level yields between 2010 and 2017. The reference yield data was obtained from the Kenya Ministry of Agriculture, Livestock, Fisheries and Irrigation (MOALFI).
The rest of the paper is organized as follows. Section 2 describes data used and explains the approach we adopted and illustrates how RS metrics from MODIS were used for maize yield prediction. In Section 3, the results are presented. This section is followed by the Discussion and Conclusions.

Study area and data
Our study area encompasses the 37 Kenyan counties that grow maize. The counties are grouped into 8 regions with respect to similarity in the maize cropping calendar (Table 1, Figure 1). Trans Nzoia and Uasin Gishu counties are the major producers of maize in Kenya.
We only considered the long-rain season and defined our start and end of growing season with the guide of regional maize calendar in (GEOGLAM, 2020) Table 1.
County level maize yield data for 2010 to 2017 was obtained from the MOALFI which has made it available via (MOALF, 2020) ( Figure 2). The data was collected by the Kenyan government field extension officials under the state department of agriculture. The data is being continuously made available online through Global Open Data for Agriculture and Nutrition initiative. There are clear regional differences in maize yield, with the highest yield in the North Rift region, followed by South Rift, Nyanza, Western, Central, Coast, upper Eastern and lower Eastern.  Figure 3 gives an overview of the techniques used to implement this study. Basic tasks of our methodological framework includes MODIS data acquisition, RS metrics computation, masking out non-maize areas using maize maps, exclusion of atmospheric and sensor affected pixels using MODIS quality masks, aggregation of the metrics spatially and temporal per county, maize yield prediction using SVR and RF, and validation of model predictions. Details of these steps are described in subsequent subsections.  Figure 3. Methodological framework adopted for maize yield prediction.

MODIS data processing
To predict yield, we used the following MODIS data products: (1) NDVI (2) Green Normalized difference Vegetation Index (GNDVI), (3) Leaf Area Index (LAI), (4) Gross Primary Productivity (GPP), (5) Normalized Difference Moisture Index (NDMI), and (6) Fraction of Photosynthetically Active Radiation (FPAR). NDVI, GNDVI, and NDMI were computed from MODIS 8-day 500 m surface reflectance data found in the MOD09 series products from the Terra satellite (NASA, 2020). The NDVI is commonly used as a proxy for green biomass. It is computed as the ratio of the reflectance in the near infra-red (NIR) and red portion of electromagnetic spectrum, that is, The GNDVI substitutes the red band in NDVI equation with green as The GNDVI was developed to estimate chlorophyll concentration in vegetation (Gitelson et al., 1996) and may be useful as a proxy for photosynthetic rate and plant stress. The NDMI is given as where SWIR1 is the short wave infra-red 1 band in MODIS surface reflectance. We used it in order to quantify water content in maize since it is sensitive to the moisture levels in vegetation. Basically, soil moisture variability is one of the main factors affecting crops productivity. Lastly, GPP, LAI and FPAR metrics are 8-day 500 m products from MODIS. GPP is a product from MODIS that was acquired from MOD17 data series products generated from Terra satellite. It is based on the radiation-use efficiency concept and can potentially be used to quantify generation of new biomass in vegetation. The LAI and FPAR are found in MOD15 product series of MODIS. LAI is a one-sided green leaf area per unit ground surface area dimensionless quantity that characterizes plant canopies. In contrast, FPAR is an important parameter in estimating biomass production because the development of vegetation is related to the rate at which radiant energy is absorbed by vegetation. Compared to NDVI, GPP, LAI and FPAR model-based biophysical variables normally show good correlation with crop yield and primary production (Coleman et al., 2017).
After computing the RS metrics, we masked out atmospheric effects, water and data affected by varying sensor conditions using MODIS quality masks that come with the products. For instance, we masked out pixels with clouds, shadows, water areas, aerosol, cirrus, fire and snow from MOD09 surface reflectance product. In LAI and FPAR products, pixels with water, snow, aerosol, cirrus, and shadows were masked out. Similarly, pixels with clouds, dead detector, and with poor confidence quality score were excluded. Finally, a second crop mask was applied on quality masked image scenes in order to retain maize growing areas only within each county.
The masked images were used to compute spatial-temporal metrics for each county using the process summarized in Figure 4. This was done by first computing mean aggregates of all pixels within each county boundary for each image scene to obtain spatial metrics. A mean aggregate of all the spatial metrics within a defined maize season was finally computed to obtain spatial-temporal metrics. This procedure is available on Earth Engine: https://code.earthengine.google.com/ 60abb28e6af6e56296452591192e1e5e.

Feature selection
Feature selection is a process of selecting relevant variables that aid model prediction. It is an important step that helps minimize model over-fitting while aiding its prediction accuracy. We used RF's mean decrease in accuracy measure from variable importance to select relevant metrics from the initial 6 that were computed. In principle, mean decrease in accuracy is computed by determining the impact a predicting variable has when it is removed from the model. Figure 5 shows the outcome of RF feature importance. Basically, GPP was the most important metric in maize yield prediction followed by NDVI, FPAR, LAI, NDMI and GNDVI. Following this guide we selected all variables except LAI with consideration of information diversity.

Maize yield prediction
We tested two machine learning methods, RF and SVR, for maize yield prediction using the RS metrics selected earlier. These models were adopted because previous studies have shown that they lead to good results compared to other methods (Kim and Lee, 2016;Kayad et al., 2019;Sakamoto, 2020).
2.3.1 Random Forest (RF) machine learning ensemble technique is based on CART (Classification and Regression Trees) (Breiman, 2001). Random forest fits many trees with a bootstrapped sample, and also takes a random sample of the variables that can be used at each split in the tree (James et al., 2013). We set the number of trees to 500 and the number of variables used to split nodes as n/3, where n = number of input variables.
2.3.2 Support Vector Regression (SVR) Support vector machines has gained popularity in image classification and regression (Vapnik, 2000). SVR is a generalization of the classification problem where the model returns a continuous-valued output as opposed to an output from a finite set. Predictions are done in SVR by using an optimal hyperplane to minimizes prediction error. We used radial basis kernel to construct the model's hyperplane. The kernel has two parameters namely and penalty parameter C. We determined these parameters via a grid search based on the least mean square error.

Model evaluation
We used cross-validation to compute Root Mean Square Error, Mean Absolute Percentage Error (MAPE) and coefficient of determination R 2 measures from data with a pair of yield and corresponding RS metrics. For example, given observed yields y and their corresponding predicted yieldŝ y, the RMSE is computed as and where n is the number of observations. The smaller the RMSE value, the closer are predicted maize yields to actual ones. MAPE is an average of the absolute percentage errors from model predictions, i.e., an average of the ratio of absolute yield errors with actual yields (Equation (5)). This measure expresses prediction error as a percentage allowing for comparisons between studies. Lastly, the R 2 explains the proportion of variance in the dependent variable that is explained by the independent variable. We used 5-fold leave one year out cross validation to compute these model evaluation measures. 3. RESULTS Figure 6 shows some of the metrics to used predict maize yields. All the metrics show an asymptotic relationship with maize yields. Maize yields increase linearly with NDVI from around 0.1 to 0.5 which corresponds to maize yields between 0-2 ton/ha. From 0.5 NDVI rises sharply to 0.7 which corresponds to yields between 2-5 ton/ha. In NDMI, a linear relationship is depicted between -0.1 to 0.1 are consistent with maize yield between 0-2 ton/ha like NDVI. When NDMI is in the range of 0.1-0.25 the maize yields sharply increase between 2-5 ton/ha. GPP exhibits a relationship with maize yields with values of ranges 100-500. Lastly, FPAR shows a relationship with maize yields when it ranges between 10-60 though with some outliers.
Cross validation results are shown in Figures 7 and 8. SVR had a RMSE 0.50 ton/ha, MAPE of 27.6% and R 2 of 0.7 which was slightly better than the results obtained with RF.

DISCUSSION
We have adopted machine learning regressing techniques to predict maize yields in Kenya using MOALFI data collected annually. The objective is to provide a remotely sensed platform for rapid yield estimation during maize growing season. Maize is a staple food for most Kenyan families and is also a source of income. Due to lack of proper maize estimates farmers have suffered from poor maize prices and other times shortage during low seasons that results to food scarcity. Therefore we adopted RS metrics from MODIS satellite for yield prediction. All the metrics are correlated to county yields recorded between 2010-2017 ( Figure 6). The GPP metric during maize growing season had the highest feature importance ( Figure 5). This is expected because GPP acquired during growing season period has been established to be one of the best indicators of the amount of new biomass (Prince, 1991;Gitelson et al., 2006) in crops and hence the reason it correlates well with maize yields. In contrast to findings by (Shanahan et al., 2001), which demonstrated that GNDVI acquired during mid-grain season was the most highly correlated with grain yield, GNDVI had the lowest importance in our study. This is because our study used GNDVI mean aggregate from the entire season as opposed to mid-gran period only.
Selected (NDVI, GNDVI, NDMI, GPP, and FPAR) metrics were used to predict maize yields using SVR and RF machine learning methods. The performance of SVR and RF was very similar. Both methods explained a large amount of yield variability. We established that the RMSE of 0.50 ton/ha (SVR) and 0.51 ton/ha (RF) is an improvement over other studies like (Guindin-Garcia, 2010). The average predictor error attained by the two approaches, i.e. 27.6% in SVR and 29.3% in RF, may be sufficiently accurate for use; but it is also clear that there is much room for improvement.
Our study has demonstrated that it is possible to predict maize yields in Kenya using MOALFI historical data. Despite these encouraging findings there is still more room to improve yield predictions. For instance, we used a maize crop mask that was generated in 2015 via expert knowledge digitization. We expect that there may have been changes in maize growing area in differ-ent counties between 2010-2017 period that we used for model prediction. Though we assumed, such changes to be negligible, use of maize crop mask generated annually to compute RS metrics might improve prediction accuracy. This is a subject of our future study. It is also important to note that administrative boundaries have changed over time through different Kenya government regimes. These changes might have introduced biases while streamlining collected maize yield data from old to new administrative boundaries. Nonetheless, despite aggregating RS metrics to the county boundaries the prediction accuracy attained is reasonable. However, although RS data is increasingly accessible at better spatial-temporal resolution and at no cost, ground reference data is still essential to design and validate RS metrics based predictions (Coleman et al., 2017).

CONCLUSION AND OUTLOOK
The study has demonstrated that maize yield estimation in Kenya can be achieved at reasonable prediction accuracy using machine learning SVR and RF. Maize yield prediction can help MOALFI, traders and other food security stakeholders. In future work, we will consider regions with similar agro-ecological and cultural farming attributes and use annual maize mask generated by deep learning in our model predictions. We hope to design the models to predict yields at pixel level in each county.