MAXIMUM ENTROPY MODELING USING CITIZEN SCIENCE: USE CASE ON JACOBIN CUCKOO AS AN INDICATOR OF INDIAN MONSOON

Due to a growing revolution of the citizen science era with the involvement of non-professionals in scientific tasks such as species observation, yields an opportunistic data for modeling and planning purposes. Such citizen science based scientific observations can be a sustainable option to answer many research questions. Here, citizen science data of the Clamator jacobinus bird is taken from Global Biodiversity Information facility to predict its habitat suitability through maximum entropy approach. The distribution data is divided into two monthly sets – June to October and November to May for critically analysing the probable climatic reasons for its migration and understanding the influence of climatic variables in its suitability during the Indian monsoon season and Southern Africa rainy season. Also, the influential role of different bioclimatic variables in determining the bird’s suitability is described in this paper and to predict how this bird will react to different climate change scenarios in 2050 year. The maximum entropy modeling is performed on both sets of data and results are surprisingly interesting, which verifies an Indian myth that this bird is harbinger of the monsoon in India. This study concluded that the precipitation during warmest and wettest quarter, and isothermality are the major factors in determining the migration of Clamator jacobinus, but, hot, dry and cold climate is not suitable for this bird suitability. Such study using the citizen science data can be used in biodiversity planning as well as in improving the agricultural economy because monsoon is considered as an auspicious season for functioning of biodiversity and agricultural tasks.


INTRODUCTION
The idea of studying climatic variations and its strong influence on species distribution started back in the literature to around 5th century BCE (Woodward, 1987). In twentieth century, this study leads to general suite of powerful maximum entropy models that can be used in generating the predictive suitability of species occurrences by occupying the detailed environmental variables (Franklin, 2010;Richardson, Whittaker, 2010). Such a maximum entropy approach of estimating the different environmental requirements of geographically distributed species using presence/absence data are known as a climate envelope modeling, ecological niche modeling, species distribution modeling, niche-based modeling or habitat suitability modeling. With such a widespread interest, nichebased models and species distribution models have always faced some conflicting views on what they truly represent. Some of the views on niche-based is that this model provides an approximation to the species' fundamental niche (Sobero´n, Peterson, 2005), while others illustrated that the species distribution model provides an spatial depiction of the realized niche on the grounds of spatial distribution patterns of observed species which are utilized to estimate the frequencies among species and environment relationships, inhibited by nonclimatic factors (for e.g. Austin et al. 1990;Guisan, Zimmermann, 2000;Pearson, Dawson, 2003).
In recent decades, countless studies are carried out on maximum entropy method (SDM) to evaluate its accuracy of predicting species range using presence only data (some of the studies are Elith Hernandez et al. 2006;Hu, Jiang, 2011;Kadmon et al. 2003;Skidmore et al. 1996;Stockwell, Peterson, 2002;Wisz et al. 2008). Hence, these studies are now serving as a guideline on what would be the minimum sample size, how many species occurrences are required, what would be the prediction on selection of random samples from voluminous data set and if the random subsampling process would be opted then what range predictions model will on per species at per sample size (Elith* et al. 2006;Hernandez et al. 2006;Kadmon et al. 2003;Stockwell, Peterson, 2002;Wisz et al. 2008). Due to its increasing popularity, maximum entropy model (Maxent) becomes widespread for species distribution modeling and papers/articles familiarized it had received more than 5,000 citations in the Web of Science Core Collection, and more than 60% of the distribution modelers report uses it (Ahmed et al. 2015). The species information from herbaria and museum, theoretically observed data and survey data, can deliver a substantial amount of resource information (Chapman, Grafton, 2008) for modeling the species habitat suitability. But, the main challenge that exists in these datasets is the location uncertainty (the error in geotagging), might have been caused due to incorrect geotagging of places or setting of GPS to improper datum. (Graham et al. 2004;Wieczorek et al. 2004). Afore popularization of citizen science and geospatial technology, species data were collected and recorded as a textual description in a form of names and places which changed over a while (Wieczorek et al. 2004). However, digitization of these textual descriptions provided substantial errors in positional uncertainty (~ several kilometres) (Feeley, Silman, 2010). As the geographical coordinates played a major role in extracting the co-located environmental variables and such erroneous occurrences would provide inaccurate specieenvironment relationship (Feeley, Silman 2010). Nevertheless, researchers have developed various techniques which estimates and documents the positional uncertainty among species occurrence records which removes the highest uncertainties prior to suitability modeling (Guo et al. 2008;Wieczorek et al. 2004) but this reduction in sample size causes model inaccuracy Hernandez et al. 2006). After assessment of such uncertainty in data collection process, citizen science, with trained or untrained participants, and information technology are considered as a robust and rigor data collection efforts ensuring qualitative and quantitative distribution records.
Citizen science "is a process where concerned citizens, government agencies, industry, academia, community groups, and local institutions collaborate to monitor, track and respond to issues of common community concern" (Whitelaw et al. 2003) "where citizens and stakeholders are included in the management of resources" (Cooper et al. 2007;Howe, 2000). In this paper, study is carried out on a current and future habitat suitability of bird, Clamator jacobinus (known as jacobin cuckoo or pied cuckoo) by implementing using the maximum entropy model. The migratory movements of this bird are considered as good indicators of monsoon in Southern Africa and India (Urfi, 2011). As per the Indian myth, arrival of Pied Cuckoo is known for beginning of monsoon, therefore, the purpose of this study is to disparagingly analyze the likely climatic reasons for its migration and which environmental variables play an influential role in its suitability. As, this species are sighted in March and April from southern India, where the species is known for its extant year-round until the middle of May. When the monsoon hits the Andaman, the first birds in northern India are seen with summer monsoon winds, afterward these birds are sighted across the West, North and East. In first week of June, the monsoon reaches Kerala and these birds are seen everywhere, except the extreme North-West and West. The extant range of Clamator jacobinus with its resident, breeding and non-breeding (downloaded from The IUCN Red List of Threatened Species) is shown in Figure 2, which is used in validating the results of this study, later in this paper.

NEED FOR CITIZEN SCIENCE IN BIODIVERSITY
The automated way of collecting geographical information on the species occurrence (Damoulas et al. 2010;Stoeckle, 2003;Turner et al. 2003) by human observers are considered as a reliable source. In biodiversity domain, many organizations have designed citizen science projects as per their scientific needs to make fine resolution data with added rigorous sampling locally and globally, such as bioblitzes (Novacek, 2008), shell polymorphism survey (Silvertown et al. 2011), water quality survey (Conrad, Hilchey, 2011), and breeding bird surveys (Freeman et al. 2007). However, many informatics challenges occur in biodiversity data collection which have been tackled in recent years are as follows (i) apposite planning required for management of voluminous set of data by handling the data-management infrastructure which also helps in motivating the volunteers, and (ii) data quality and handling of certain observational biases essential to sighting variation among volunteers (Cooper et al. 2007), 'false absences' that yields inadequate sightings (McClintock et al. 2010) and often uneven data distributions (Boakes et al. 2010). Thereby, the need of addressing such challenges prompted the rise of data intensive science (Hey et al. 2009), which is being applied to large-scale citizen-science based research (Kelling et al. 2009). After reviewing accuracies of citizen observatory data, the observation data Global Biodiversity Information Facility (GBIF) repository is used in this paper to apply a data intensive science approach in maximum entropy modeling. The motivation and solution for using occurrences data of GBIF because GBIF has a process of managing the voluminous as well as continual data, acquired from thousands of volunteers using informatics and social networking. Also, GBIF is an authentic repository where various organizations/ institutes share their data by ensuring its data quality and, particularly data quantity which are relevant to modeling and decisionmaking purposes.

Distribution Data
The distribution data obtained from the GBIF repository for Clamator jacobinus are total of 36,365 records, observed on the basis of specimen data, human and machine observations, holding very high number of NA values and duplicate records. To clean distribution records, NA values and repeated geographical records are eliminated and final set are left with 10,292 presence records among which majority is of citizen science data (10,160 records are human observations). These citizen observations are incorporated into the GBIF database by various organizations such as eBird, Kenya Bird Map, naturgucker, Safring, National Biodiversity Data Bank  and Southern African Bird Atlas Project2 (SABAP2). These data are continuously in use for developing of suitaibility models and for planning the safeguarding actions (Coxen et al. 2017;Pacifici et al. 2017;Robinson et al. 2018;Sullivan et al. 2017). For distribution data, this study targeted GBIF datasets because this is the most updated catalogue on species distributions, having occurrence datasets from resourceful and authentic citizen science databases, systematic surveying stations and ad hoc observations from experts. To the best of our knowledge, GBIF maintains a proper balance between quality and quantity of citizen science data at a broad scale. After accounting the null and duplicate records from raw data, 10,292 independent presence records are used in constructing the suitability model for Clamator jacobinus species. For the study researched in this paper, the distribution data is categorized into two sets on the basis of favourable climatological seasons as per target species extant -November to May set (mostly suitable for bird residant in Southern Africa) and June to October set (Indian monsoon).

Current Climatic Data
To study and determine the eco-physiological responses of Clamator jacobinus with respect to rapid changes in the environment, nineteen bioclimatic variables for the period 1970-2000 are obtained from the WorldClim database at a spatial resolution of 2.5 arc minutes (~4.5 km) catalogue (Fick, Hijmans, 2017;Hijmans et al. 2005;Smeraldo et al. 2018). These bioclimatic variables are built using monthly data from the 1st of January 1970 to the 31st of December 2000 at very high resolution, and contains more meaningful information than simple precipitation.

Future Climatic Data
For predicting the habitat suitability conditions from current to future, bioclimatic layers for the year 2050 (averaged for 2041-2060) analogous to the climatic responses of Representative Concentration Pathways (RCP) 8.5. These datasets are obtained on the basis of mean ensemble of various Coupled Model Intercomparison Project (CMIP5) models (Amman et al. 2003;Amman et al. 2007;Sato et al. 1993;Stenchiko et al. 1998) at spatial resolution of 2.5 arc minutes (~4.5 km). The RCP 8.5 is used to estimate a little migration by observing a pessimistic scenario, in which atmospheric CO2 levels of 2100 are 2.5 times higher than current levels (Riahi et al. 2011;Sharma et al. 2017).

Model Construction and Validation
The focal objective of this research study is to predict the scenarios of future suitability from Clamator jacobinus current suitability. For model construction related to this research, machine learning based Maxent approach is used to predict habitat suitability by establishing the spatial relationship between presence records and most significant bioclimatic variables, which further provides high extrapolation accuracies, even for absence or less presence records (Phillips et al. 2004;Phillips et al. 2006). Before constructing the model, it is required to estimate reasonable geographic boundaries because the Maxent map outputs are scale dependent and the range of climates predict the most suitable habitat by relating the environmental conditions in which species are consistently absent. Therefore, if the range is too wide, then the conditions for favourable environments may be lost in the prevalent poor environments. For this necessarily primary approach, a two-degree buffer is considered around the extremes of the species extant and range of bioclimatic variables geographical boundaries. This model starts with uniformly distributed data of Clamator jacobinus and runs in multiple iterations with maximum substantial environmental variables until no improved predictions are built.
To assess the behaviour of model, a k-fold cross validation approach (Kohavi, 1995) is applied on occurrence data, which separated the data into two sets, each containing five random observation group (because five case fold is considered here)train and test data. The train data holds 75% of total records for model construction and remaining 25% is with test data to validate the model results. The second step was the selection of parameters -(i) betamultiplier -a regularization multipliers value is set to '1' because the smallest value makes the projected distribution more close fit to the training data, its value ranges from 1 to 15; (ii) Explanatory Variables (EVs) -in model, four variable transformation types are used as a calculation features for continuous EVs such as hinge, linear, quadratic and threshold Phillips, Dudík, 2008); and (iii) threshold features used are as 'lq2lqptthreshold', 'l2lqthreshold' and 'hingethreshold'.
The statistical assessment of model is done by the area under receiver operating characteristic (ROC) curve (AUC) which gives the probability of randomly chosen presence and absence site (Fielding, Bell, 1997). The ROC analysis is responsible for evaluating the model performance by two factors -sensitivity (absence of omission error) and specificity (absence of commission error) are used to test the predictions. The AUC process involves setting of thresholds on prediction to generate false positive rate (prediction of presence for site where species is absent) and then evaluates the true positive rate (successful presence prediction) as a function of false positive rate. The random ranking of sites has an average AUC of 0.5 and a perfect ranking of sites shows the best possible AUC of 1.0, also the models with AUC above 0.75 are considered possibly suitable with noble discriminatory command (Elith 2002;Phillips, Dudík, 2008).

RESULTS
Maxent is a machine learning process that goes through multiple iterations to convert a training model into acceptable model. Due to this stochastic nature, multiple replicates of the model must be run in order to bring the average of the outputs to a suitable result. Therefore, the model was set to ten replicate runs for two sample sets, in each set 75% of random samples are used for standardizing of model and remaining 25% are for assessing the model's implementation. The first set of data consists of citizen observation records from June to October month, which is Indian summer monsoon season with more than 90% of total annual precipitation in central and western parts of India, and 50%-75% of total annual rainfall in southern and north-western parts of India. The second set of data holds distribution records from November to May month (sighted by citizen scientists), covering four Indian climatological seasonspost-monsoon, winter and summer, and rainy season of Southern Africa where Clamator jacobinus is known for its residant.

Model Evaluation
After execution of model, its performance is measured by the plots shown in Figure 3 and 5. These plots deliver information on AUC for both training and test data, the red (training) line denotes the model is performing well for the training data and blue (testing) line indicates the performance of model with respect to the testing data, and it is good for model that training AUC will be higher than test AUC. If the model performs worse, then the blue line falls below the turquoise line, whereas, if blue line sets on top left of the graph, the model is better in predicting the present records on test data. If the the sensitivity value is to closer 1, the prediction is better. It can be seen in both the plots the model predicted better than the random one, with AUC higher than 0.5 i.e., 0.909 for first set of data and 0.834 for second set. Figure 4 and 6, showed the separate AUC for both sample sets -0.97 and 0.9 which is closer to 1, thus this model's prediction can be trusted.

Contribution and Importance of Environmental Variables
Of the 19 variables used in modeling (Table 1), the significant environmental conditions affecting the spatial distribution of Clamator jacobinus are isothermality (16.8%), mean temperature of warmest quarter (15.7%), annual precipitation, precipitation of warmest quarter (13.6%) and precipitation of wettest month (11.3%) during Indian summer monsoon season, i.e. June to October. The permutation importance involves method of computing feature importance by measuring the decrease when feature is not present. The variables having highest permutation importance are precipitation of warmest quarter and isothermality.  For another seasonal set (November to May), the environmental variables favourable for suitability of Clamator jacobinus are precipitation of warmest quarter (31.9% of variation), isothermality (16.1% of variation), temperature seasonality (14.9% of variation) and precipitation of wettest quarter (8.9% of variation) ( Table 2). This model gives precipitation of warmest quarter as a highest permutation importance in computing the current site suitability of species which explains that the preferable season for this bird's suitability is monsoon season.

Current and Future Suitability of Clamator jacobinus
Based on the presence records of June to October month, the model computed the suitable habitat of Clamator jacobinus (Figure 7) depicting potential suitability of this bird is in India due to monsoon season and remaining parts of its residant is found unsuitable or low suitable for its distribution. The colour scale shows the probability of output models ranging from 0 to 1 denoting the suitability ranges -high (light and dark green colour), medium (yellow and dark brown), low suitability (light brown colour), and unsuitable (grey colour). In Figure 7, the southern part of India such as Andhra Pradesh, Goa, Karnataka, Kerala, Maharashtra, and Tamil Nadu shows high and medium site suitability, whereas western part -Gujarat is having medium suitability, and northern parts of India is found to be suitable at high and medium range. And the Southern Africa is found unsuitable for this bird because its June to October months are having dry and hot climate which is not favourable for this bird's habitat. The area covered by grey colour depicted that the species suitability is not available in those parts, which is true when compared with its extant layer of IUCN, as shown in Figure 2. Another suitability for the months November to May is shown in Figure 8 with a scale bar depicting the good suitability range of Clamator jacobinus in South Africa and southern parts of India. Figure 8 illustrated that the Southern Africa is having the medium and low range suitability whereas the South Africa is found with good suitable extant of this bird. As Indian monsoon season typically lasts from June to September in North and Central India, therefore, model provides a low or no suitability in these parts, whereas, in South India, especially Tamil Nadu gets most of its annual precipitation in October and November, thus, Figure 8 is satisfying this favourable season which proved that model is predicting the good suitability for this species.  This study also predicted future suitability map of 2050 ( Figure  9) from current suitability conditions by projecting changes in movement of bird and this revealed that the range contraction would happen in all parts of India except southern parts of Tamil Nadu which would experience little bit of improvement. Also, the quantiles (at 5% and 95%) of relevant environmental variables (occupied by bird) are calculated to get a comparative range in climate change scenarios of current and 2050 suitability for Clamator jacobinus with respect to Indian monsoon season (Table 3).

DISCUSSION
This is one of the first studies to explore the habitat suitability of Clamator jacobinus in Southern Africa and India by efficaciously demonstrating the efficacy of a wide-scale citizen science observations data in developing distribution models for habitat availability across an entire extant of species. Based on the presence records and environmental variables, species distribution modeling is done for the current distribution of Clamator jacobinus in Indian and Southern Africa. The results of model are validated against the extant layers of The IUCN Red List of Threatened Species.
According to suitability results of Indian monsoon season, seems surprisingly interesting to verify the old age belief that this bird is a sign of monsoon in India. Also, the study presented in this paper proved that the bird is not favourable in other climatological seasons such as hot, dry and cold, therefore when the season changes in Africa other than monsoon, the bird makes its arrival to North and Central India through South India from Southern Africa by the journey of Arabian Sea, along with the monsoon winds and thereby, monsoon starts in India from June and lasts till September.   Table 3. Comparative climate change scenarios of current and 2050s The maximum entropy approach gives a natural probabilistic interpretation on utmost to tiniest suitable habitat conditions and its results can be easily inferred by the domain experts. In Maxent, a set of distribution estimates the target distribution by finding the suitability using maximum entropy, which is closest to uniform, in such constraints that the expected value of each feature under this estimated distribution set should matches with its empirical average. The inexpensive citizen science data remains a vital source for biodiversity monitoring and keeps encouraging public participation in the process of scientific findings as well as for country well-being (Bonney et al. 2009;Ward et al. 2015).

CONCLUSIONS
Correlating bioclimatic variables with species distribution for suitability analysis is a needed stride towards implementing conservation plans. Citizen science catalogues can provide a vast set of distribution records in achieving better results for modeling of species. As a matter of fact, the old-age belief and myth is true in context of Jacobin Cuckoos that this bird arrives in most parts of central and northern India with monsoon winds but due to climatological changes in 2050 its suitability may decrease as compared to current one. Citizen science and maximum entropy are valuable suitability prediction techniques for ecological requirements, allowing several climatic variables to be assimilated with maximum entropy modeling. The study presented in this paper could help decision makers in defining conservation plans and places Clamator jacobinus inhabit in future, as a crucial perspective for any strategy on biodiversity preservation. Elith, J., 2000. Quantitative methods for modeling species habitat: comparative performance and an application to Australian plants. In Quantitative methods for conservation biology. 39-58. Springer, New York, NY.