A COMPARISON OF DECISION TREE-BASED MODELS FOR FOREST ABOVE- GROUND BIOMASS ESTIMATION USING A COMBINATION OF AIRBORNE LIDAR AND LANDSAT DATA

Forest is one of the most crucial Earth’s resources. Forest above-ground biomass (AGB) mapping has been research endeavors for a long time in many applications since it provides valuable information for carbon cycle monitoring, deforestation, and forest degradation monitoring. A methodology to rapidly and accurately estimate AGB is essential for forest monitoring purposes. Thus, the main objective of this paper was to investigate the performance of decision tree-based models to predict AGB at a site in Huntington Wild Forest (HWF) in Essex County, NY using continuous forest inventory (CFI) plots. The results of decision tree, random forest, and deep forest regression models were compared using light detection and ranging (LiDAR), Landsat 5 TM, and a combination of them. The results illustrated the importance of integration of Landsat 5 TM and LiDAR data, which benefits from both vertical forest structure and spectral information reflected by canopy cover. In addition, the deep forest model with a root mean square error (RMSE) of 51.63 Mg/ha and R-squared (R) of 0.45 outperformed other regression tree-based models, regardless of the dataset.


INTRODUCTION
Forest is considered as one of the most valuable Earth resources, which is required to be monitored in a timely manner (Bastin et al. 2017). Sustainable forest management is of paramount significance for many applications, namely forest productivity, monitoring carbon sequestration, and investigating deforestation. Importantly, forest aboveground biomass (AGB) plays a crucial role in carbon sequestration, which contributes to global climate change issues (M. Li, Im, and Beier 2013). Accurate AGB estimation has been an area of interest for many researchers. Conventional field measurement techniques provide an accurate estimation of AGB while they are labor-intensive, costly, time-consuming, and not applicable for large regions (M. Li, Im, and Beier 2013). In recent years, remote sensing data paved the road for a cost-effective AGB estimation over large areas.
Optical and synthetic aperture radar (SAR) imagery are valuable sources for forest monitoring applications. However, saturation is the most common issue with these datasets (Joshi et al. 2017;Kachamba et al. 2016). It worth mentioning that saturation occurs in forests with multilayer canopies or high dense biomass when spectral reflectance values of pixels are not sensitive to biomass changes, which affects the quality of AGB estimation (Zhao et al. 2016). Zhao et al. (2016) reported that saturation is more severe for AGB values greater than 130 Mg/ha. In addition, weather conditions (e.g., rain, snow, shadow, and cloud cover) can greatly affect the quality of the optical data. Light detection and ranging (LiDAR) is another remote sensing data, which directly measures the vertical structure of forest canopy (Boudreau et al. 2008). Although LiDAR can provide valuable information for AGB estimation, it is costly and limited for large-scale applications.
So far, many studies have been concentrated on leveraging remote sensing data for AGB estimation (Issa et al. 2020;Dube et al. 2016). These studies have compared different remote sensing datasets and reported the achieved results. Several studies have been focusing on combining the LiDAR, optical, and SAR data to maximize the potential of these datasets for AGB estimation (Shao, Zhang, and Wang 2017;Urbazaev et al. 2018). Zhang et al. (2019) and Cao et al. (2018) used the integration of LiDAR and optical imagery to improve the estimation of AGB. The combination of SAR data with optical and LiDAR data has been used by Shao and Zhang (2016) and Hyde et al. (2007) which enhanced AGB prediction results. Machine learning techniques are one of the commonly used models in AGB estimation since it is more compatible with the non-linear inherent characteristic of remote sensing data (C. . Among machine learning algorithms, decision tree-based models have shown better performance in AGB prediction (Y. . The random forest (RF) regression algorithm has been widely used in AGB estimation and has shown promising results (Mutanga, Adam, and Cho 2012;Dang et al. 2019;Karlson et al. 2015).
According to the existing papers on forest AGB estimation using remote sensing data, there is always room to fully explore the potential of models and datasets to improve the accuracy of AGB estimation. The main objective of this paper is to address the capability of the combination of LiDAR and optical data for accurate AGB estimation. In order to achieve this goal, three well-known decision treebased machine learning algorithms are implemented using the integration of Landsat 5 thematic mapper (TM) imagery and airborne LiDAR data. Thus, this study presents a comprehensive comparison between decision tree (DT), RF, and deep forest regression models.
This study focuses on investing the following research aims: 1) assessing the potential of integration of Landsat 5 TM and LiDAR data for AGB estimation, 2) comparing decision tree-based algorithms for predicting AGB values, 3) evaluating whether deep forest model can provide better results in comparison to DT and RF.

Study Area
This project was conducted on the Huntington Wildlife Forest (HWF) area, which is located in the central Adirondack Park in northern New York State ( Figure 1). HWF, with an approximate area of 6,000 ha (latitude 44E 00" N, longitude 74E 13" W), was donated to the State University of New York, College of Environmental Science and Forestry (SUNY-ESF) for research purposes. The elevation of the mountainous topography of HWF property ranges from 473 m to 908 m above mean sea level. Huntington has a mean annual temperature of 4.4 Celsius degree and a mean annual precipitation of 1010 mm (S. Li, Quackenbush, and Im 2019). Huntington forest contains 72% of northern hardwoods, 18% of mixed hardwood conifer, and 10% of conifer species.

Field Inventory Data
In this study, SUNY-ESF continuous forest inventory (CFI) plots have been used as reference data. This comprehensive dataset was collected during the summer of 2011. The CFI dataset of HWF in 2011 contained 288 sample plots with approximately 807 m 2 circular regions. In each sample plot, all trees with a diameter at breast height (DBH) of 11.7 cm and greater were measured. For each tree, tree species, DBH, and the relative location to the center of the sample plot were recorded (S. Li, Quackenbush, and Im 2019). Then, AGB at the tree level was calculated using speciesspecific DBH allometric equations (Kennedy et al. 2018). Finally, plot-level AGB was calculated as the average AGB per unit area within each sample plot (S. Li, Quackenbush, and Im 2019). In other words, the plot level AGB in megagrams per hectare (Mg/ha) was calculated by dividing the tree level AGB by the plot area.

Figure 1. Location of the study area (Huntington Wildlife
Forest) in Essex County, NY for forest AGB estimation using decision-tree based models. Black circles indicate sample plots located in Huntington wildlife forest.

LiDAR data
Discrete return LiDAR data collection was acquired over HWF in May 2015 using the Leica Airborne Laser Scanner (ALS70). First, a k-nearest neighbor imputation algorithm (k=5) was used to convert the raw point clouds into heightnormalized point clouds. Then, predictors were computed using the height normalized LiDAR data for modeling at 30 m grid cells. Finally, 29 predictors were computed and fed as inputs into the machine learning models (Table 1). Since field measurements were collected in 2011, the main hypothesis was that HWF did not change from 2011 to 2015.

Landsat 5 TM Imagery
Google Earth Engine (GEE) cloud platform was used to process and download the Landsat 5 imagery, and then R software was used to train the model and estimate AGB values. Spectral bands and some biomass-related vegetation indices were used to train the regression models. Table 2 lists vegetation indices used in this study. Spectral bands were extracted using Landsat 5 TM imagery in 2011 for HWF. Landsat 5 dataset contains three visible, one nearinfrared (NIR) band, and two short-wave infrared (SWIR) bands with 30 m resolution. These images are atmospherically ortho-corrected surface reflectance. A cloud mask was applied to the imagery to remove the cloud effect in the acquired images.

METHODS
In this paper, three decision tree-based machine learning regression models including DT, RF, and deep forest were deployed and compared. Decision tree-based algorithms are a subset of ensemble learning which help to decrease the variance and increase the stability (Dey 2016). The R software and Python 3.7 packages were used to implement regression models and predict AGB for LiDAR data, Landsat 5 TM imagery, and integration of LiDAR and Landsat 5 TM data. Each model was run using a training/testing split of 70/30 to calculate the root mean square error (RMSE) and R-squared (R2). The following subsections describe a brief background and parameters regarding each model. Parameters were tuned through a grid search approach.

Decision Tree (DT)
DT is the most popular machine learning technique which builds the foundation of tree-based models (Kotsiantis 2013). It develops a regression model based on a tree structure of the conditional statement. DT uses attributes in the dataset to break down the data into smaller subsets by making decisions. Nodes can be divided into two categories: decision nodes and leaf nodes. The former specifies decisions to split the data while the latter defines the value of the attributes. DT provides straightforward interpretation and manages non-linear data. However, DT is prone to over-fitting, and a small noise in the input dataset can remarkably influence the predictions (Song and Ying 2015). Package "rpart" in R was used to implement the DT ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-3-2021 XXIV ISPRS Congress (2021 edition) model. Figure 2 shows the selected parameters after the grid search approach for the combination of LiDAR and Landsat data. The complexity parameter (cp) is the minimum improvement needed at each node. The minsplit defines the minimum number of observations in the root node (decision node) that could be broken down. The minbucket denotes the smallest number of observations in a leaf node. The maxdepth prevents the tree growth from a certain depth.

Random Forest
Random forest is an ensemble non-parametric method, which combines many decision trees in parallel (Mahdianpari et al. 2017). It uses a combination of bagging, which randomly selects variables, but with replacement, as training for growing the tree. If there are M input predictors, then m ≤ M predictors are selected randomly out of M, and the best split on m is used to split the node. Each tree is grown to the largest possible extent without pruning (Ali et al. 2012). RF uses a bagging technique to make sure variety in trees, thus reduces the over-fitting. Moreover, it can handle noisy datasets. An R package named "randomForest" was used for RF model training. The list of selected parameters is shown in Figure 2. The ntree is the number of trees in a forest. The mtry defines the number of random variables at each split. The nodesize is the minimum number of samples within the leaf nodes.

Deep Forest
Deep forest is a novel decision tree ensemble approach, which can be considered as an alternative for deep neural networks (DNNs) with fewer hyper-parameters and complexity (Zhou and Feng 2017). In contrast to DNNs, deep forest runs faster, and it is much easier to train. While DNNs require large-scale training data, deep forest can perform well with small-scale training data (Zhou and Feng 2017). This approach is also known as multi-grained cascade forest (gcForest). A cascade structure enables deep forest to do representation learning, while in DNNs, representation learning is done by the layer-by-layer processing of features. In deep forest, each level of cascade gets the feature information processed by its preceding level and gives its processing output to the next level (Zhou and Feng 2017). The number of cascade levels can be adaptively determined to perform well even with smallscale data (Zhou and Feng 2017). It is worth mentioning that each level is an ensemble of random forests (e.g. an ensemble of ensembles). This method contains different types of forests to increase the diversity which is required for ensemble constructions. In order to implement the deep forest model, a combination of gcForest and sklearn packages in R and Python was utilized. Figure 2 shows the selected parameters for deep forest implementation. The n_cascadeRF defines the number of random forests in a cascade layer, while n_cascadeRFtree specifies the number of trees in a single random forest of a cascade layer. The n_mgsRFtree defines the number of trees in an RF during multi grain scanning.

Figure 2.
Packages and selected parameters used for the implementation of decision tree-based models for AGB estimation (Huntington Wildlife Forest) in Essex, NY using the combination of LiDAR and Landsat data.

RESULTS AND DISCUSSION
This section represents the results of implemented decision tree-based models on LiDAR, Landsat 5 TM, and a combination of LiDAR and Landsat 5 TM data. Table 3 summarizes the RMSE and R 2 of DT, RF, and deep forest models. In addition, AGB maps produced by each regression model are demonstrated in Figure 3.  Table 3. Results of HWF AGB estimation using decision tree, random forest, and deep forest models and integration of LiDAR and Landsat 5 TM imagery.
As shown in Table 3, LiDAR data provides a smaller RMSE than Landsat data, which indicates the importance of vertical structure captured by LiDAR data. Although tree diameters are more related to AGB, height characteristics derived by LiDAR data can be efficiently used for AGB estimation (Zhao et al. 2016). The most probable issue with the low performance of Landsat imagery might be due to the saturation problem. The AGB of HWF varies from 0 to 433.2 Mg/ha (Table 4). Landsat 5 TM suffers from saturation that greatly influences the RMSE and R2. Thus, in this study area, using Landsat-only imagery is not the best option for accurate AGB estimation. By combining LiDAR and Landsat 5 TM imagery, RMSE decreases, which is a great sign of improvement in AGB estimation. The reason behind this improvement is that using an integration of LiDAR and Landsat will benefit from both vertical and spectral information. Thus, the AGB estimation will improve. The trend of performance increasing by using both LiDAR and Landsat data can be seen in all three regression models.  Figure 3 shows the AGB maps produced using the combination of Landsat 5 TM and LiDAR data, which provided the best results in terms of RMSE and R 2 for all regression tree models. The maximum range of AGB is limited to 350 Mg/ha since there is no estimated AGB above this range due to the saturation issue with high biomass. As seen in Figure 3, both deep forest and RF were capable of predicting AGB within a wider range than DT. DT did a poor job in estimating biomass with low and high values. Furthermore, the histograms of the three AGB maps are plotted to provide more information about the raster ( Figure 4). All three maps suffer from saturation issues, and they cannot estimate AGB values for more than 340 Mg/ha. As shown in Figure 4, deep forest estimated the AGB values from 0 to 340 Mg/ha, and according to the AGB map, it nicely shows the regions with high and low biomass. The RF model predicted AGB values from 50 to 300 Mg/ha, which shows the area with low and high biomass better than the DT model. However, RF did not perform well in areas with high biomass, which are nicely recognized in deep forest. The range of AGB estimation for DT varies from 110 to 230 Mg/ha. The DT did a poor job at predicting AGB for both low and high biomass regions.  Histograms of AGB maps for three tree-based regression models: deep forest, RF, and DT in HWF. The AGB maps were produced using decision tree-based models and the combination of LiDAR and Landsat 5 TM imagery.

CONCLUSION
The main objective of this study was to investigate the capabilities of remote sensing data and machine learning algorithms for accurate AGB estimation. The combination of LiDAR and Landsat 5 TM data using the deep forest regression model provided the most accurate AGB estimation. Vertical characteristics captured by LiDAR and spectral information derived by Landsat imagery could improve the AGB prediction. Deep forest, a highly competitive alternative for DNNs, outperformed RF and DT models. Thanks to the unique characteristics of deep forest, biomass can be predicted more accurately. It is recommended to use other optical and radar imagery such as Sentinel-2 and Sentinel-1 with 10 m spatial resolution and Bayesian optimization hyperparameter tuning for further studies.