CLIMATIC EFFECTS ON ARGENTINA’S TRADITIONAL DRINK – A MACHINE LEARNING BASED YERBA MATE PRODUCTIVITY PREDICTION

: Yerba mate belongs to the most important agricultural products of Argentina. Due to climate change, together with El Niño and droughts, yields are negatively affected. Drought propagation from precipitation deficits to plant water stress, its impacts and especially its predictability are becoming an emerging field of research. This paper explores how the future yerba mate production in the states of Misiones and Corrientes, north-eastern Argentina, can be projected by using climate variables and satellite data. The applied methodology focuses on a machine learning approach based on multiple linear regression and random forest regression. The results indicate a significant relationship between yerba mate and the NDVI as well as the SOI. The highest yerba mate productivity is expected during weak La Niña events. The methodology of this analysis can successfully predict mate productivity based on satellite and climate data and can also easily be used for further research areas.


INTRODUCTION
Climate change and enforced effects of global teleconnections like El Niño are severely impacting traditional agriculture. In particular, the increasingly frequent occurrence of droughts has a major impact on agricultural productivity. In general, a drought describes a condition in which a precipitation deficit over a certain period of time causes moisture and water availability below average (Mishra & Singh 2010). The term is characterised by diverse definitions, which in turn are fundamental for the overall understanding of the context.
A meteorological drought occurs along with a precipitation deficit over a period of time. As it is based on the long-term mean relationship between evapotranspiration and precipitation, meteorological droughts can occur at any time of the year (Lloyd-Hughes & Saunders 2002). If the lack of precipitation persists for several weeks or months, soil moisture and infiltration decrease and consequently limit plant growth (Tallaksen & van Lanen 2004). In this case, an agricultural drought occurs (Mishra & Singh 2010). As soon as hydrological systems experience water shortages (e.g. unusually low groundwater recharge or river streamflow), the agricultural drought turns into a hydrological drought (Tallaksen & van Lanen 2004). If a drought has economic and social consequences, such as health, it is referred to as a socio-economic drought (Apurv et al. 2017). The whole process is also known as drought propagation (van Loon et al. 2012). When this paper discusses drought, we only refer to meteorological and agricultural drought.
As a quantitative assessment of a meteorological drought occurrence, its characteristics and propagation, drought indices like the Standard Precipitation Index (SPI) are a common tool (Salimi et al. 2021). In addition, analysing a drought _________________________________ * corresponding author through indicators is increasingly linked with research on drought impacts to deduce useful threshold values for drought prediction. As agricultural droughts are mostly expressed through crop production failures, forest fires or livestock sales, such variables are often correlated with drought indices (Stagge et al. 2015). In particular, the prediction of future crop production is one of the central components of agricultural research. As in many other areas of research (e.g. remote sensing), newer approaches use a diverse range of machine learning (ML) algorithms to predict future developments of drought indices (e.g. Byakatonda et al. 2019). Furthermore, the ML-based prediction of drought impacts on crop production is developed by several scholars, including Potop et al. (2012) or Kogan et al. (2019).
Climate change and global warming are accompanied by a multitude of global and regional impacts. Accelerated hydrological processes, caused by rising temperatures and an increased moisture demand of the atmosphere, and the absence of precipitation will contribute to a higher probability of severe future drought events (Li et al. 2009;Mukherjee et al. 2018). Predicting future climatic conditions is very difficult, as current climatic phenomena like El Niño are not completely understood yet.
The El Niño Southern Oscillation (ENSO) is a large-scale ocean-atmosphere system associated with ocean current and temperature changes (Morid et al. 2007). To assess the intensity of ENSO, the Southern Oscillation Index (SOI) is used, which measures the difference in air pressure between Tahiti and Darwin (Australia). Negative SOI values indicate an El Niño event, while positive SOI values are indicative of a La Niña event. These air pressure differences can lead to farreaching effects that affect large parts of the Earth, so-called teleconnections. For Argentina, an El Niño event is usually synonymous with high precipitation (Ulla 2016). However, the effects of ENSO differ greatly between regions, as do the impacts of climate change. For north-eastern Argentina, the study area of this paper, climate change models mostly predict reducing rainfall amounts in winter and spring. In summer and fall, however, increasing precipitation is more likely. This increase can probably be attributed to more frequent extreme events and is overall characterised through a higher interannual precipitation variability (Cabré et al. 2016;Marengo et al. 2010). Studies such as Thomasz et al. (2018) show different impacts of increasing drought occurrence in Argentina in response to climate change. Portela et al. (2015) examine historic drought events in the border zone of Brazil, Paraguay and Argentina and show drought phases between 1960 and the 1970s as well as since the mid-1980s. Historic droughts occurred in the research area in the years 1977/78, 1981, 1988, 1999/00, 2006/13 and 2020/21 (Vargas et al. 2011Global Drought Observatory 2022).
Yerba mate (Ilex paraguariensis) is a tree that originates from southern South America (between 21°S and 30°S and 48°30'W and 56°10'W). The plant needs regular precipitation, a minimum annual rainfall of 1200mm and high soil humidity. It is less sensitive to temperatures but grows best at an average temperature of 21°C (Croge et al. 2021;Heck & Mejia 2007). Its leaves were already consumed in pre-Columbian times by indigenous peoples (e.g. Guaraní) as an infusion, which is nowadays very popular in Argentina, Uruguay (mate), Brazil (chimarrão) and Paraguay (tereré). In addition, the plant is also used for e.g. energy drinks, cosmetics and textiles (Croge et al. 2021). In Argentina, it is mainly cultivated on plantations in the two states of Misiones and Corrientes since the early 20th century (Heck & Mejia 2007). In 2019, approximately 945,962 tonnes of yerba mate were produced worldwide, about 31.9% of them in Argentina (FAO 2021). The harvesting season in Misiones lasts roughly from April to September (Butiuk et al. 2016). Per capita consumption of yerba mate in Argentina is approximately 5 kg per year, only surpassed by Uruguay. Similar to some Asian beverages, mate tea has a strong social component, as it is usually shared (Bracesco et al. 2011). Furthermore, mate tea has numerous positive health properties. For instance, it might prevent some cancer types and is anti-inflammatory (Souza et al. 2015).
Most studies on drought impacts focus on staple crops such as grains. Less often, however, plants that have social rather than nutritional benefits are considered. For this reason, we are trying to find out with our research whether it is possible to make statistically significant statements about mate productivity in north-eastern Argentina on the basis of climate data and drought indices. Consequently, this paper tries to answer the following research question:

How can yerba mate productivity in Argentina be predicted reliably based on satellite and climate data?
To answer this question, we use multiple linear regression and random forest regression. We are not aware of any study that investigates mate productivity based on satellite data or drought indices. We published our data and script on GitHub (https://github.com/HannahKemper/YerbaMate.git).

Research area
The states (provincias) of Misiones and Corrientes are located in the extreme northeast of Argentina, enclosed by the neighbouring countries Paraguay, Brazil and Uruguay. Fig-ure 1 shows a map of the research area. In the national context, both Misiones and Corrientes are comparatively small and sparsely populated states (INDEC 2022). The landscape consists mainly of forests, savannas and grasslands (Velazco et al. 2018). Some important bodies of water include the Iberá wetlands, the Iguazú waterfalls and the Río Paraná, Argentina's longest river. In 2021, the latter was featured internationally in the press due to its low water level (BBC 2021).
The climates of Misiones and Corrientes are characterised by subtropical, humid conditions, which belong to the category Cfa according to Köppen and Geiger. While the average annual rainfall is around 1800mm, the mean annual temperature is at 21°C. In the summertime, the maximum daily air temperature can rise up to 39°C, whereas the coldest days in July can reach -6°C (Eibl et al. 2000). ENSO can influence weather patterns and agricultural yields in the region due to its ocean-atmospheric interactions. As ENSO events strongly correlate with precipitation, wet summer conditions in northeastern Argentina are associated with El Niño (warm events) and dry summer phases with La Niña (cold events) (Portela et al. 2015, Podestá et al. 2002Pol & Binyamin 2014). The latter has a higher influence in the research area, with a greater likelihood of remaining stable over a longer time (Rusticucci & Vargas 2002). Along with this, yields tend to be reduced during La Niña (Gutierrez 2017;Podestá et al. 2002).

Data sources
In order to create a suitable database for the analysis, meteorological and remote sensing-based datasets were used. The research period extended from 1990 to 2020, limited by the agricultural data availability. Information on the yerba mate production was obtained from the open data portal of the Argentinian Ministry of Agriculture, Livestock and Fisheries (Ministerio de Agricultura, Ganadería y Pesca 2022). Information on temperature was downloaded via Google Earth Engine, using a dataset from Rodell et al. (2004).
The Standardized Precipitation Index (SPI) provides information on the development of the precipitation deficit through the water cycle. In combination with data on agricultural production, it can be used to make statements about the occurrence of agricultural droughts and emerging vulnerabilities (Barker et al. 2016;Hayes et al. 2011). Negative values indicate dry conditions, with moderately dry conditions classified at values from -1.0 to -1.49, severely dry conditions at values from -1.5 to -1.99, and extremely dry conditions at values smaller than -2.0 (Morid et al. 2007). The SPI was calculated for annual values on the basis of 30-year precipitation sums from the CHIRPS dataset (Funk et al. 2015), using the following formula:  (2) The Normalized Difference Vegetation Index (NDVI) can be calculated due to the high absorption by vegetation in the red (R) spectrum and the simultaneous lack of absorption in the near-infrared (NIR) spectrum (Schmidt & Barron 2020;Pettorelli et al. 2011). It is the most commonly used vegetation index (Guo et al. 2015) and expresses plant health using the following formula: NDVI data was obtained using the Landsat 5 and 7 dataset in Google Earth Engine (Google Developers 2022).
All variables were then calculated for yearly mean values on the department (departamento) level using a Zonal Statistics tool in QGIS 3.12.

Exploratory Data Analysis
A first step in a statistical analysis is usually an exploratory data analysis. For this purpose, we calculated a correlation matrix, using Spearman's to check for correlations between the explanatory variables (cf. Figure 2). High coefficients are usually a good indicator for multicollinearity, which is a common problem in regression analyses. In our data, there was only one variable combination with a correlation coefficient > 0.7, which is a common threshold (Dormann et al. 2013). The relation between precipitation and SPI showed a correlation coefficient of 0.94, which is sensible as the SPI is based on precipitation values. Consequently, precipitation was not considered as a separate variable in the analysis.

Figure 2.
Correlation matrix. The Spearman correlation coefficient was used to calculate statistical similarity between the explanatory variables.

Regression analysis
The data cleaning and further analyses were conducted using Python 3.8. The data basis used in this study had to be cleansed by treating missing and outlying values, as these could affect the reliability and plausibility of relationships between variables. Missing values in yerba mate production were imputed using region-based mean values. Further outliers were detected and excluded using an interquartile range of 0.98 (Wong & Wang 2003). The values were normalised using the standard scaler in scikit-learn (Pedregosa et al. 2011). The independent and dependent variables were tested for skewness and split into training and testing data. For this, the training data size was set to 70%, a common threshold (Dobbin & Simon 2011).

OLS Model
One of the most common and simplest linear regression models is the Ordinary Least Squares model (OLS), which can be described by the following formula: where y is the dependent variable, " is a n × 1 vector of ones associated with the constant term parameter , X denotes an n × K matrix of explanatory variables associated with the K × 1 parameter vector " and is an error term (Halleck Vega und Elhorst 2015).
The explanatory variables were ordered according to theoretical relevance in order to add them in a stepwise approach. The defined order was SPI, SOI, NDVI, temp. The resulting specifications were then tested for their suitability using statistical measures, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC) and the coefficient of determination (R²). Additionally, the variance inflation factor (VIF) was calculated to check for multicollinearity.

Random Forest Regression
A common ML algorithm applied for the prediction of numerical values is the random forest regression (RF). The RF is a so-called ensemble of regression trees where the predictions of several decision trees are averaged (Strobl et al. 2009). Using a randomised search cross validation, numerous models were calculated, varying parameters including the number of trees, the depth of trees, the minimum samples per split and the minimum samples per leaf. The result of the validation was the best-performing parameter combination for the data basis (Koehrsen 2018). As the model performance generally varies with sample size, we performed an additional k-fold cross validation (n_splits: 10) in order to evaluate the average performance of our models.
As a measure of model quality, several indicators were used. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) are among common quality measures (Bachmair et al. 2016). Further, the percentage of correctly predicted values and the relative importance of the variables in the database were considered for the evaluation of the models. For the RF, we used the normalised variables.
The calculated RF model was then used to predict yerba mate productivity using different values for the explanatory variables. A value range with 0.1 steps was defined for all variables, stretching between the minimum and maximum values of the respective variable, with one standard deviation added to each. Predictions were made based on all possible combinations.

OLS Model
Different specifications for a simple OLS model were calculated. The results can be found in Table 1. Based on the AIC and BIC, model 3 seemed to be the most appropriate for the data basis. The specification that added temperature had a practically identical AIC, but the added variable was not statistically significant. For both of these models, both SOI and NDVI had positive correlation coefficients and were statistically significant (p ≤ 0.05). The SPI, on the other hand, was not statistically significant in any of the model specifications, which was also reflected by its strongly fluctuating coefficients.
The VIF was well below the commonly applied threshold of 10 for each variable, implying low multicollinearity. The condition number, a similar measure, also remained below the popular threshold of 30 (Lever et al. 2016

Random Forest Regression
The randomised search cross validation revealed that a model with 200 estimators was the most accurate. Details on the parameters and performance can be obtained from Table 2. In both models, the order of importance of the variables was the same, with the respective values differing slightly. The SOI was the most important variable within the models. The average accuracy, i.e. the number of correctly classified data points, of the RF models was 71.9. We

DISCUSSION
According to the results of the OLS model, there is a statistically significant relationship between yerba mate productivity and the SOI and NDVI. If the NDVI increased by 0.1, the productivity would also rise by 902 kg/ha. The relationship with the SOI was also positive. If the SOI increased by 0.1, the productivity would rise by 101 kg/ha. The interpretation of this value is not quite simple, which is mainly due to the nature of the linear model. It is doubtful that a strong La Niña event delivers higher yields than under normal conditions. However, one could interpret that a drier event is better for mate cultivation than too wet conditions, i.e. El Niño. There may be a connection to the possible irrigation of the plants, which was not controlled for in our model due to lack of data.
The RF performed well considering the difficulty of predicting numerical values ranging from 0 to 7,500 kg/ha. An accuracy value of over 70% was satisfying, given that the MAE values were smaller than 0.7.
Considering the predicted yerba mate productivity for the SPI, NDVI and SOI, a more complex image was given on the relationship between the variables. Predicting the expected production for certain SPI, NDVI and SOI values was helpful to understand the model's interpretation of the data. The great importance of the SOI combined with the fact that this variable can be predicted up to six months in advance could be relevant for future research.
The RF results indicated that the maximum yerba mate production is to be expected during weak La Niña events. El Niño events, which tend to be weak in north-eastern Argentina, on the other hand, were estimated to provide rather low yerba mate production. The relationship between vegetation vitality and productivity also appeared clear, with high NDVI values being related to high predicted production. For the SPI, the correlations were not quite so clear, as the values fluctuated strongly. However, negative SPI values, which represent drier phases, were most likely to be present in high forecasts. However, similar SPI values were also associated with very low predictions. The interpretation of this result was not entirely clear. It may even indicate a threshold value at which mate productivity tips.
The methodology presented here should only be understood as a first approach to the topic, as the analysis relied only on very basic statistical methods. Other, non-linear types of regression models or ML algorithms could provide more insightful statements in future research on this topic. For this, data with a higher spatial resolution or an explicit geo-reference with regard to the cultivation areas would probably be advantageous.
Several shortcomings of the methodology should be noted.
As it was already mentioned, we were not able to control for irrigation practises on yerba mate plantations due to missing data. Furthermore, the extension of our research area to other regions with yerba mate plantations could be sensible. Particularly, an extension to the south of Brazil, the largest producer of yerba mate (FAO 2021), could enhance the quality of the models. Since the employed variables, in particular the SOI, were not geographically explicit and not restricted to the plantations, a more pixel-based approach could be adequate for future studies. Then, information on soil types or topography could also be considered. In order to make a more accurate statement regarding droughts, one could also use a logistic regression which is based on a binary classification of the dependent variable. This was tested for our data, but it did not yield any significant results and was therefore discarded. However, it could be considered sensible for another case study.

CONCLUSION
According to the regression analysis performed in this paper, the relationship between the yerba mate productivity and the SOI and NDVI are both of high significance in north-eastern Argentina. Future monitoring of SOI values could be of importance for the prediction of the Argentinian yerba mate production, as the highest productivity values were predicted for weak La Niña events. Even though there were some limitations, the chosen methodology offers great potential for further use in other geographical regions, considering other crops and the integration of additional remote sensing information.