EVALUATING A CONVOLUTIONAL NEURAL NETWORK FOR FEATURE EXTRACTION AND TREE SPECIES CLASSIFICATION USING UAV- HYPERSPECTRAL IMAGES

The classification of tree species can significantly benefit from high spatial and spectral information acquired by unmanned aerial vehicles (UAVs) associated with advanced feature extraction and classification methods. Different from the traditional feature extraction methods, that highly depend on user’s knowledge, the convolutional neural network (CNN)-based method can automatically learn and extract the spatial-related features layer by layer. However, in order to capture significant features of the data, the CNN classifier requires a large number of training samples, which are hardly available when dealing with tree species in tropical forests. This study investigated the following topics concerning the classification of 14 tree species in a subtropical forest area of Southern Brazil: i) the performance of the CNN method associated with a previous step to increase and balance the sample set (data augmentation) for tree species classification as compared to the conventional machine learning methods support vector machine (SVM) and random forest (RF) using the original training data; ii) the performance of the SVM and RF classifiers when associated with a data augmentation step and spatial features extracted from a CNN. Results showed that the CNN classifier outperformed the conventional SVM and RF classifiers, reaching an overall accuracy (OA) of 84.37% and Kappa of 0.82. The SVM and RF had a poor accuracy with the original spectral bands (OA 62.67% and 59.24%) but presented an increase between 14% and 21% in OA when associated with a data augmentation and spatial features extracted from a CNN.


INTRODUCTION
Currently, one of the major challenges for conservation is to obtain reliable and accurate information at a large scale to monitor biodiversity, resources as well as the human impact on natural ecosystems (Wagner et al., 2019). Remote sensing is considered an effective means for this effort, not only because of the increased spatial and temporal resolutions of the datasets, which enable identifying elements of biodiversity, such as tree species, but also because of the increase in available data, the computational capacity to process such data, and the development of advanced classification methods (Ghosh et al., 2014;He et al., 2015;Kwok, 2018).
Small-format hyperspectral cameras on-board unmanned aerial vehicles (UAVs) provide both high spectral and very high spatial resolution data, markedly increasing the scope of remote sensing applications. UAV-borne sensors enable to collect data even under cloud cover conditions. Moreover, they are flexible regarding spatial and temporal resolution, what makes them a cost-effective and operational solution for tree species classification (Nevalainen et al., 2017;Tuominen et al., 2018;Sothe et al., 2019;Miyoshi et al., 2020).
Nevertheless, a fact that must be considered when using high spatial resolution data for tree species classification is the differences in light conditions and spectral variability within the crowns, such as branches, presence of lianas, background and shadows, which can negatively affect the classification accuracies. To deal with this, many studies resort to textural features as a way to include spatial information in the classification process. The gray-level co-occurrence matrix (GLCM), for instance, is frequently applied for tree species classification (Franklin, Ahmed, 2017;Maschler et al., 2018;Ferreira et al., 2019;Sothe et al., 2019). However, such techniques commonly require predefined spatial filter and other parameters which are subjectively determined by the user according to his/her knowledge of the problem. Moreover, these spatial features are aim-specific ones, which means that only one specific type of objects can be detected by each parameter configuration, making it impossible to describe all types of objects by setting empirical parameters (Zhao, Du, 2016).
Besides the data and feature extraction, a proper choice of the classification method is also decisive for a successful classification result. Machine learning algorithms, such as support vector machine (SVM) and random forest (RF), are considered robust and work well in the presence of a wide range of class distributions and with high dimensionality and multisource data (Ghosh et al., 2014), but still depend on handengineered features (Li et al., 2017). Recently, deep learning, a class of machine learning, has been introduced into remotely sensed images classification and hyperspectral images in particular. Deep learning methods are able to automatically extract high-level spatial features from hyperspectral data (Chen et al., 2016;Zhao, Du, 2016;Signoroni et al., 2019), showing great robustness and effectiveness in image classification (Chen et al., 2014;Li et al., 2017;Wagner et al., 2019). Among these methods, the convolutional neural network (CNN) algorithm is a supervised deep learning model that has been producing promising results in tree species classification Fricker et al. 2019;Hartling et al., 2019;Sothe et al., 2019). However, few studies explored the CNN for tree species classification in (sub)tropical forests (e.g., Sothe et al., 2019) and the potential of deep features extracted from a CNN for this purpose remains unknown.
A major challenge in the classification of tree species in tropical forests is the high number of species, some of them dominant and other rare. It results in an imbalanced sample set, with a small number of samples available for the less often found tree species (Mellor et al., 2015). In this case, sampling the natural abundance of species would lead to highly skewed sample sizes across classes, while increasing the sample sizes of rare species would be time-consuming and costly (Graves et al., 2016). The CNN requires a large number of training samples in order to capture the essential features of the data (Pasupa, Sunhem, 2016;Yu et al., 2017), which can be an obstacle when used to classify a large number of tree species in tropical forests. To address this problem, a data augmentation step to increase and balance the number of training samples can be performed before the classification. Among the available data augmentation methods, flip, translation and rotation operations preserve the scene topologies of the data, which is especially important for consistent classifications, but enhance the intra-class data diversity and does not incur inter-class ambiguities (Yu et al., 2017).
In this study, a CNN method associated with a data augmentation step was tested and compared with conventional machine learning methods, SVM and RF, for the classification of 14 tree species in a subtropical forest area. The potential use of a data augmentation and deep spatial features extracted from a CNN and incorporated into the SVM and RF classifiers was also evaluated.

Study area and samples collection
The study area is located in the municipality of Curitibanos, Santa Catarina State, Southern Brazil ( Figure 1). The area covers an extension of approximately 30 ha and belongs to the Atlantic Rain Forest biome and the Mixed Ombrophilous Forest phytophysiognomy, comprising both coniferous and broadleaves species. According to the Köppen-Geiger classification, the climate is Cfb, moist mesothermal with no clearly defined dry season, with a mean annual temperature of 15 °C and a yearly rainfall of 1,616 mm (Peel et al., 2007).
The sample collection was performed after the acquisition of the hyperspectral data, in which some crowns were selected in the images and surveyed in the field. Eighty individual tree crowns (ITC) representing 14 tree species were identified, corresponding approximately to 80% of the dominant tree species in the area (Table 1

Input data
The flight was conducted in December 2017 using a quadcopter UAV (UX4 model) and a frame format hyperspectral camera based on a Fabry-Perot Interferometer (FPI), model 2015 (DT-0011). The camera has two CMOSIS CMV400 sensors that by means of an adjustable air gap are flexible in selecting up to 25 spectral bands ranging from 500 to 900 nm with the minimum bandwidth of 10 nm at the full width at half maximum (FWHM) (Honkavaara et al., 2013) (Table 2). In the preprocessing stage, the images digital numbers were first transformed into radiance values with photon units of pixel −1 s −1 using the Hyperspectral Imager, software developed by Rikola Ltd (2014) and supplied with the camera. The correction parameters are available in ASCII files organized according to the sensor and respective data, the positioning of the FPI and the FWHM. Next, the dark signal correction was performed using black images collected prior to the data captured with covered lens. For the geometric processing, the camera geometry and the orientation of each band were reconstructed using the interior orientation parameters (IOPs) and the exterior orientation parameters (EOPs), estimated using the so-called on-the-job calibration, after a refinement of the initial values. Such initial values for the camera positions were assessed by the GNSS receiver and involved latitude, longitude, and altitude (flight height plus the average terrain elevation) data. In the sequence, the coordinates of six ground control points (GCPs) (Figure 1) were added to the project and measured in the corresponding reference images. According to Miyoshi et al. (2018), frame format cameras results in more stable imaging geometries and uses fewer GCPs than pushbroom sensors. After the bundle adjustment, the final errors of the GCPs (reprojection errors) were 0.03 pixels in the image and 0.003 m in the GCPs.

CMOSIS CMV400 sensors
Afterwards, the orthorectification was performed starting with the generation of a dense point cloud. At the last stage, the orthomosaics of all the bands were generated from the orthoimages of each hypercube band. The geometric processing and orthorectification were performed for each of the 25 spectral bands, which automatically coregister them regarding the slight positioning difference among bands of the same image caused by the time sequential operating principle of the camera (Honkavaara et al., 2013;Miyoshi et al., 2018). The orthomosaics of the 25 spectral bands were stacked to compose the original (or raw) dataset.

Classification
For the SVM classification, the one-against-one multiclass strategy and the radial basis function (RBF) were adopted. A 5fold cross validation was carried out in the training samples set to tune the cost parameter, while the gamma value was set during the classification process with the function sigest of the kernlab package (Karatzoglou et al., 2004) in R programming (R Development Core Team, 2018). The RF classifier was performed using 500 trees. The default value for mtry parameter was kept, which corresponds to the square root of the total number of features used in each experiment (Breiman, 2001). The RF classification was conducted using the randomForest package (Liaw, Wiener, 2002) in R programming. In the case of the SVM and RF classifications associated with a data augmentation and deep features extracted from the CNN, they were respectively named CNN-SVM and CNN-RF approaches, and the same parameters of their conventional classifications described above were used.
For the CNN, CNN-SVM and CNN-RF approaches, a data augmentation step using flip and rotation operations was executed prior to the classification. The training samples were replicated as the feature space was rotated and flipped in different directions until they reached an amount of 15,000 pixels per class, while the few species with training samples exceeding 15,000 pixels were downsampled.
The feature extraction and classification using the CNN were performed using the architecture shown in Figure 2, executed in Keras with TensorFlow backend (Abadi et al., 2015). It consisted of five convolutional layers, three pooling layers, a fully-connected layer and a classification layer. The numbers of kernels for the successive convolutional layers were 32, 32, 48, 48, 64, and 128 for the fully-connected layer, with a learning rate of 10e-4. After every convolution operation and the fullyconnected layer, a batch normalization, followed by a leaky rectified linear unit (Leaky ReLU) activation function, was applied. The Adam optimizer (Kingma, Ba, 2015) parameters were set to default values. To deal with overfitting, the network was trained using early stopping and dropout regularization of 0.35 after the fully-connected layer and before the top layer. The last layer of the network (classification layer) was composed of a softmax activation function that performs a pixel-wise classification upon the learned representative features. For the CNN-SVM and CNN-RF approaches, the learned features corresponding to the output of the fullyconnected layer are used as input for the SVM and RF classifiers. The total number of CNN parameters was 186,896. Figure 2. CNN architecture adopted in this study.
In the inference step, the trained network was applied over the image to generate the classification maps. The CNN classifier was applied to overlapping image patches to predict the class of their central pixel using a sliding window technique with a stride set to 1. Next, each query was spatially concatenated to obtain a classification at the same resolution of the input image. After testing different window sizes, the evaluated network was designed to receive a patch of 33x33 pixels and to output a probabilistic vector of size equal to the number of classes, where the index location of the highest value indicates the most probable class. Figure 3 depicts the methodological flowchart of tree species classification using the CNN method and the CNN-SVM and CNN-RF approaches. It should point out that for all classifiers and approaches, the ITC samples were randomly split into training (and validation) and test sets prior to the classification. Training samples together with validation samples were used to train the classifier and to find the best classification parameters, while test samples were separate for the accuracy assessment. Note: sp1 (yellow) and sp2 (green) representing different species; ITC= individual tree crown.

Accuracy assessment
To evaluate the classification results, the confusion matrices were generated based on a cross-check between the classified results and test samples, corresponding to 50% of ITCs not used in the training or validation steps. With the confusion matrices, different agreement indices were calculated: (a) overall accuracy (OA); (b) precision (i.e. producer's accuracy), (c) recall (i.e. user's accuracies); (d) F-measure and; (e) Kappa index.
Even aware of the uncertainties in expanding the tree species classification over the entire area because not all the species were represented, the classified images were made to analyze the representativeness and abundance of each species, observing the classification patterns and the agreement in classifications resulted from different classifiers/approaches. The non-forest areas were removed by a CHM mask, considering pixel values below 2 m. We emphasize that expanding the classification to the entire study areas would lead to a misclassification of the tree species not surveyed in the field work. This is the reason why such species were not included in the confusion matrices reckoning.

RESULTS AND DISCUSSION
Results showed that the CNN classifier outperformed the SVM and RF classifiers (Table 3), reaching an OA of 84.37% and Kappa of 0.82. The SVM and RF had a poor accuracy when only the original spectral bands were used (OA 62.67% and 59.24%), but an increase between 14% and 21% for SVM and RF, respectively, was observed when a data augmentation process and deep features extracted from the CNN were employed by these classifiers.  Table 3. Tree species classification results. Figure 4 shows the F-measure of each tree species and classifier/approach. In general, the CNN outperformed the classification for most species, except for Campomanesia xanthocarpa and Schinus sp2, in which the CNN-RF approach had a slightly better performance. When using only the original bands, the SVM and RF had a poor performance for most species, showing that the process to increase and balance the sample set and the extraction of deep CNN features were crucial for improving their results. The use of the CNN classifier as well the CNN-SVM and CNN-RF approaches showed to be particularly relevant to increase the accuracy of species with fewer pixel samples, as in the case of Cinnamodendron dinisii, Cupania vernalis, Nectandra megapotamica, Podocarpus lambertii, Schinus sp1 and Schinus sp2 as shown in the confusion matrices ( Figure 5). For these species in particular, the data augmentation process could significantly increase their accuracy. Yu et al. (2017) observed that the experimental results with augmentation operations outperform those from the same deep model architecture training on the original sample set. According to the authors, the diversity and completeness of remote sensing data can be greatly enhanced by data augmentation. Figure 5. Confusion matrices of RF classifier (top) and CNN-RF approach (bottom). Note: Species ID according to Table 1.
The spatial information extracted from the CNN may also have been decisive in improving the accuracy of some classes. Nectandra megapotamica, for instance, is a species with small crowns and considering this, even with eight ITC samples, it presents a small number of pixel samples. This may result in a higher intraclass variability, which was better captured by the spatial information extracted from the CNN. Zhao and Du (2016) also highlighted the potential of the CNN as a method to extract spatial features for land cover classification. They reported that the CNN had better results when compared with conventional feature extraction methods, such as principal component analysis and linear discriminant analysis.  Maschler et al. (2018) observed that the inclusion of texture information derived from GLCM did not significantly increase the accuracy of tree species classification when using SVM or RF methods. Zhao and Du (2016) noticed that the method involving spatial features extracted from a CNN produced maps with less "salt and pepper" effect, which was also observed in this study ( Figure 6). The CNN classifier ( Figure 6b) produced a more homogeneous map than RF ( Figure 6a) and SVM due to the use of deep spatial features. This is an important issue to be considered, because in maps with the "salt and pepper" effect it is very common that an ITC of one single species is composed by pixels classified as different species. This effect can also be minimized when adopting a segmentation method to delineate the crowns previous to the classification. However, for tropical forests, this task is very time consuming and usually involves many steps (e.g., Tochon et al., 2015;Wagner et al., 2018). In fact, a study conducted by Sothe et al. (2019) reported that the SVM only reached similar accuracies to CNN when 3D information was incorporated into the dataset and the final classification was aggregated into segments. They also pointed out the use of a CNN for tree species classification as a promising way to extract spatial features, while dealing with the intraclass variability of high spatial resolution images, without the need of a segmentation procedure.

CONCLUSION
This study investigated the ability of a CNN associated with a data augmentation process, as a feature extraction and classification method for classifying 14 tree species in a subtropical forest by means of UAV-borne hyperspectral data. Results showed that the CNN outperformed the conventional machine learning methods SVM and RF, with an OA of 84.37% and a Kappa index of 0.82. SVM and RF had a marked increase when associated with a data augmentation step and deep spatial features extracted from a CNN. With an OA of 76.83% and 80.73%, respectively, SVM and RF were 14% and 21% more accurate than the classification using only the original training sample set and spectral data. In addition to not requiring handengineered features, the CNN stood out for producing more homogeneous tree species maps without the need of previous segmentation steps, which is usually imperative when dealing with very high spatial resolution images.