DEEP MULTITASK LEARNING FOR TREE GENERA CLASSIFICATION

The goal for our paper is to classify tree genera using airborne Light Detection and Ranging (LiDAR) data with Convolution Neural Network (CNN) Multi-task Network (MTN) implementation. Unlike Single-task Network (STN) where only one task is assigned to the learning outcome, MTN is a deep learning architect for learning a main task (classification of tree genera) with other tasks (in our study, classification of coniferous and deciduous) simultaneously, with shared classification features. The main contribution of this paper is to improve classification accuracy from CNN-STN to CNN-MTN. This is achieved by introducing a concurrence loss (Lcd) to the designed MTN. This term regulates the overall network performance by minimizing the inconsistencies between the two tasks. Results show that we can increase the classification accuracy from 88.7% to 91.0% (from STN to MTN). The second goal of this paper is to solve the problem of small training sample size by multiple-view data generation. The motivation of this goal is to address one of the most common problems in implementing deep learning architecture, the insufficient number of training data. We address this problem by simulating training dataset with multiple-view approach. The promising results from this paper are providing a basis for classifying a larger number of dataset and number of classes in the future.


INTRODUCTION
The use of airborne LiDAR for tree species classification has proven its effectiveness with high success rate.Compared to traditional aerial imageries, a new set of crown structural variables being able to derive from the three-dimension (3D) data, a generation of studies has evolved based on this technology.Numerous studies have been conducted to use LiDAR only (e.g.Lim et al., 2003;Holmgren and Persson 2004;Brandtbert, 2007;Yao et al., 2012) and a combination of LiDAR and spectral signature for tree species classification (e.g.Holmgren et al., 2008;Jones et al., 2010, Liu et al., 2011, Alonzo et al., 2014) with promising accuracy.Internal tree crown geometry metrics such as branching structures and external tree crown geometry metrics such as overall shape of the tree crown has also been taken into consideration in the design of classification features (Ko et al., 2013;Blomley et al., 2017).With the increasing computer power, lower sensor cost and more complex algorithm, LiDAR data are being more readily available, and the field of research had grown significantly in the recent years.However, these researches rely on the derivation of hand-crafted classification features, which needs a lot of human intervention.Also, hand-crafted classification features are sensitive to local changes, a classification that works well in a study area may not work well in another.
In this paper, we propose a unique and efficient approach of classifying tree genera from LiDAR collected data by a deep learning approach.One of the advantages is the needlessness to design hand-crafted classification features.This way, when the number of classes increases in the future, there is no need to redesign the classification features, even with the change in geographical location.Often the same tree genera (or species) growing in different environment may appear differently and may require additional classification features when new datasets are added.Also, trees that are different in ages may appear differently; often these are the challenges in tree genera classification.A universal set of classification features for all tree genera (species) is almost impossible.However with deep learning, classification features can be learnt through the representation of the dataset.Also, rather than classifying tree genera as a single objective (or single task), we assigned two tasks for a classification objective (coniferous-deciduous and genera classification).Intuitively, same tree genera share the same classification features, but we also observe that tree belong to the same group, coniferous or deciduous also share common classification features within the group.We propose a new way of classifying tree genera that has two unique advancements.The first is the use of Multi-task Network and the second is the introduction of a constraint term (concurrence loss ( cd )) between the tasks for improving overall accuracies.

Deep Convolution Networks
The field of classification (and object recognition) has a very long history, this field falls into a wide discipline of Artificial Intelligence (AI) (Goodfellow et al., 2016) and has been applied to many domains of science such as speech recognition, computer vision and many more (LeCun et al., 2015).In environmental science, the use of remotely sensed imagery for classifying natural objects falls into this AI category and there exist a wide range of literature that discuss the methodologies for supervised learning with various machine learning methods.
Until recently, a subset of machine learning methods, deep learning has been actively researched and applied in different fields.The application of deep learning has attracted much attention since it has dramatically improved the classification results with outstanding classification accuracy (LeCun et al., 2015).Deep learning falls into the category of representation learning where classification features are derived from the representation of the dataset itself (Goodfellow et al., 2016, Chapter 15).As opposed to classical feature learning where classification features are hand-crafted and designed depending on the nature of the raw dataset.This approach often requires some prior knowledge of the raw data as well.Few benchmarking examples of deep learning research include LeNet-5 (LeCun et al., 1998), ImageNet (Dent et al., 2009), GoogleLeNet (Szegedy et al., 2015) and Microsoft ResNet (He et al., 2016).The needlessness of feature derivation is the major advantage of representation learning over feature learning; it is a more automatic approach where there is less prior knowledge needed for the dataset.Some of the earlier work of the deep learning model originated from artificial neural networks (ANNs) and was analogical to the neural networks of the brain.Where learning is breaking down into layers (input layer, hidden layers and output layer) and the learning model is an assembly of inter-connected nodes (called neurons) and weighted links.The term "deep" from deep learning comes from neural network composed with multiple hidden layers.Convolutional Neural Network (CNNs) is one of the most popular feed forward deep learning architectures because of its high performance accuracy.As a result, we adopted a CNN architecture for our tasks.Although CNN can be implemented in 3D space, we limit our experiments in 2D space for this study and the 3D convolution will be addressed in the future study.The use of CNNs adapted to 3D data is not discussed in this paper but several approaches can be found in examples such as Boulch et al. (2017), Huang and You (2016) and Lawin et al. (2017).

Multiple-view Data Generation
One of the most challenging problems in deep learning (or any learning problems) is the limited amount of trained dataset.A large pre-labeled dataset set is required to train the network for learning and often, the acquisition of these pre-labeled data is costly.Data augmentation refers to the generation of new dataset by providing transformations (e.g.scaling, rotating, and translating) to the original data.Its aim is to increase the number of training data and it has been proven the effectiveness by avoiding a well-known overfitting problem in image classification (Krizhevsky et al., 2012).Data augmentation also addresses the problem of imbalance training and can improve CNN-based methods (Chatfield et al., 2014).In 3D, a particular way of data augmentation is to generate a set of 2D images from viewing an object at different angles, called multi-view data generation (Su et al., 2015).We transformed the 3D LiDAR trees into 2D image space for using the existing tools that are already developed for the deep-learning algorithms.

Multi-task Network (MTN) for Tree Genera
MTN has been applied in different areas such as computer vision and drug discovery with success (Ramsundar et al., 2015;Girshick, 2015).As the name suggests, MTN learn more than one task at the same time with shared classification features.In our study, the two tasks are coniferous-deciduous classification and genera classification.
In general, by performing multiple tasks simultaneously, classification features derived from the classifier are less likely to be tailored for one single task and therefore could perform better by avoiding overfitting the classification problem.According to Ruder (2017), MTN is successful because learning from different tasks average noise patterns observed in different tasks.Also, in the case where data is very noisy, features derived from the first task can be verified with its relevancy by other tasks.Moreover, the network can learn certain features from a task that is easier for the feature to learn from and by introducing an inductive bias.Thus, MTN can reduces the risk of overfitting, resulting better classification accuracy.

STUDY AREA AND MATERIAL
Airborne LiDAR has been useful in studying trees, whether it is for tree height estimation (Nilsson, 1996), retrieving biophysical variables (Popescu et al., 2004), or species classification (Yao et al., 2012).The LiDAR data collected for this paper was acquired 7 August 2009 by Riegl LMS-Q560, the study area is approximately 75 km east of Sault.Ste.Marie, Ontario, Canada.The average point density is approximately 40 points per m 2 .186 Individual trees crown was manually detected and segmented from the LiDAR scene before field validation.Field surveys were conducted in the summer of 2009 and 2011.There are eight field sites to capture the diversity of environmental conditions.Seven of the sites belong to a forested area and one site was chosen along two sides of a section of transmission corridor.From the 186 tree samples, 160 of them belong to the genera of Pinus (pine), 67 samples; Populus (poplar), 59 samples; and Acer (maple), 34 samples.The detailed description of the LiDAR dataset can be referred to our previous work presented in Ko et al. (2013).The 186 tree crowns are manually segmented out from the LiDAR scene as pre-processing.In this paper, we will focus on the classification of the three classes.Figure 1 show 5 example images for each genera, pine (top row), maple (middle row) and poplar (bottom row).Greyscale of the LiDAR points represents northing (m) in UTM.

METHODS
We separate our methods into four sections; section 3.1 described the preparation of input data and the method of multiview data generation.Section 3.2 describes the splitting of training and testing dataset.Section 3.3 and section 3.4 describes the implementation of STN and MTN, respectively.STN classifies the trees into genera and MTN contain two tasks, first classifies the trees into genera and the second classifies the tree into coniferous or deciduous, using STN described in section 3.3 as base network.

Generating Multi-view LiDAR Tree Image Generation
For each segmented LiDAR tree, we produce 64 2-channel 2D images.The first channel (C1) of each image records the location of each detected LiDAR point and the second channel records the actual height of the tree.The first channel is generated by the inspiration of depth imageries that are widely used in computer graphics, where the depth channel records the distance between the object and the view point.Similarly, we produced C1 by projecting each tree crown into UTM easting (x) and height (z).The representation of each discrete LiDAR reflection is symbolized by circle where the grey gradient (8-bit) represent the UTM northing (y).Each image has aspect x and z in the ratio of 1:1, the size of the image is fixed by adding a border around each image resulting a 500 × 500 pixels image.This implies the smaller trees (younger in age) and the bigger trees (older in age) will occupy the image the same way.Since C1 is normalized to the size of the image, we introduce the second channel (C2) where height above the lowest point of the tree crown (m) is recorded as pixel value.This way, two similar looking trees (in C1) that are differing in sizes will appear differently in C2, that is, the absolute size of the tree crown is preserved in C2.The images are then stored as 2-channel images.For illustration purpose, C1 is represented by red and C2 is represented by green.Figure 2(a) shows an example of a LiDAR pine tree projected to x-z axis.Each segmented tree crown has a local coordinate system where the origin (0,0) of the axis is located at the centre bottom of the tree crown.The grey scale represents the location of LiDAR points 3m out of the paper or 3m into the paper (-3m), this will lead to the generation of C1, represented in Figure 2(c).Figure 2(b) shows an example of the same pine tree with the same axes, where grey scale represents the height of the tree above the lowest point of the tree crown.Note that the highest point recorded for this tree is 12.3m.We then rotate the tree in 64 different angles about the vertical axis of the x, y centroid of the segmented tree crown.The angle of rotation is 2/64 where  = 1 … 64, the projection plane remains the same where x represents the UTM-easting rotated and z represents the height rotated.Figure 3 shows an example of the generated 64 trees from a single pine tree.The top down view of the tree is included to indicate the 64 angles of rotation for the tree about the tree centre.This process is being repeated for all 160 trees.
Figure 3.The generation of 64images from a single tree, example shows the results generated for C1 As a result of this method, we generate 10240 2-channel images (160 tree x 64 multiple views) for training and testing the results.Figure 3 illustrates the results of 64 C1 images generated from a single tree.64 C2 images are generated the same way are then combined into 2-channel images as the method described above.

Training Data and Testing Data
We split the data into two categories, training data (25%) and testing data (75%).We had chosen this partition from our previous study for comparison reasons (Ko et al., 2013).From the 160 trees we have collected, we randomly selected 40 trees for training, when a single tree is selected, 64 associated images are also selected as training.In the case of MTN, the same training and testing images are used for both tasks.
Where M = Maple; Po = Poplar; P = Pine; D = Deciduous; C = Coniferous Table 1.Training and testing data partition for the experiments

Single-task Network (STN)
Motivated by LeNet-5 (LeCun et al., 1998), we designed a base network for tree genera classification, as in Figure 4.The base network consists of three Conv layers (Conv1, Conv2 and Conv3) and two fully connected layers (Fc1 and Fc2), where each Conv layer is composed of a series of layers of a convolution, batch normalization, Rectified Linear Unit (ReLU), max pooling, and dropout.Since this base network contains one single task, it is considered as a Single-task Network (STN).Note that for our experiment, the dimension of the input data is 48 pixels × 48 pixels × 2 channels.

Figure 4. Summary of STN network
When the convolution operation is applied to the input with a filter, the result is called the convolution layer.The common two parameters for this layer are the filter sizes and the number of filters.In our study, the size of the filters is chosen based on examples such as LeNet-5 (LeCun et al., 1998) where authors have 32 × 32 pixel images, the first convolution operation use a 5×5 filter.In AlexNet (Krizhevsky et al., 2012) (1) Where (  ,  , ; ) = cross-entropy between predicted label probability and field-validated label  = weights in the network

Multi-task Network (MTN)
Inspired by the work of Ruder (2017) and Liao et al. (2017), we extended the STN into MTN.In order to get the benefit from both tree genera classification and coniferous-deciduous classification, the proposed multi-task network is built to perform two tasks: 1) tree genera classification (major task), and 2) coniferous-deciduous classification (auxiliary task).Similar to STN described in section 3.3, MTN has the three convolution layers (Conv1, Conv2 and Conv3), with the difference of splitting into fully-connected layers that generate the outcome for each task.A summary of MTN is shown in Figure 5.The three convolution layers are shared among the two tasks Conv1, Conv2 and Conv3 with the same setting as STN described in Figure 4. task).It is defined as the average cross-entropy of the predicted probability for the classification and field validated labels.The second is  , which we named as concurrence loss.The purpose of   is to minimize the inconsistencies between the outcome of the two tasks.We achieve the minimization by transforming the outcome for tree genera classification (major task) into coniferous-deciduous probabilities through the function  ge2cd (Figure 5).Then, we compared these derived probabilities with the probabilities obtained by the auxiliary task.The objective of the network is to minimize the difference by introducing  cn , the average cross-entropy between these probabilities and the probabilities from the auxiliary task.
The proposed MTN is trained to minimize   () which comprises of three losses   ,   () and   () (3)  (2) is for deciduous), indicating the softmax output (predicted probability) for the binary classification of the th data Both STN and MTN were implemented using Tensorflow Ver.1.3 (Abadi et al., 2016) with NVIDIA GeForce GTX 1080 Ti.We are using Adam optimization (Kingma et al., 2014) with learning rate of 5.0 × 10 -5 , dropout at 0.7 and mini-batch size of 100 for both learning networks.

RESULTS AND DISCUSSION
We process the classification for both STN and MTN and provided the confusion matrices for the classification in section 4.2.The overall genera classification accuracy of STN is 88.7% and the overall genera classification accuracy of MTN is 91.0%.The improved classification accuracy indicates there is a future potential use of MTN for larger amount of data as well as increased number of genera.To illustrate the performance of MTN, Section 4.1 will show the results of the classification accuracies over various epoch for MTN and section 4.2 will shows the results of classification performance (STN and MTN) with confusion matrices.

Loss over Epoch for Multi-task Network
It has been shown in the previous research by Krizhevsky et al. (2012) that STN converges to minimum loss as epoch increases.
One of our goals is to investigate if MTN will also perform the same way.To illustrate the performance of MTN, Figure 6 shows the loss functions Figure 6.  ,   ,   and   over Epoch From the figure, we can see that all the loss functions (  ,   ,   and   ) converge when epoch reached 300.Also, the loss for genera classification is higher than the coniferousdeciduous classification, meaning the binary classification generally has higher classification accuracy (a relatively easier task).Also,   decreases as epoch increases, meaning the inconsistencies between the predictions also decreases when epoch increases.

Confusion matrices for STN and MTN
To have a better understanding on the classification results, we present the confusion matrices of STN for genera classification and MTN for both coniferous-deciduous classification and genera classification.Table 2 shows the confusion matrix for genera classification for STN and Table 3 shows the confusion matrix for coniferous-deciduous classification for STN.Although the goal for STN is genera classification, we produce results for Table 3 to show classification accuracy improves for both our prime goal (genera classification) as well as the secondary goal (coniferous-deciduous classification).Table 4 shows the confusion matrix for genera classification for MTN and Table 5 shows the confusion matrix for coniferousdeciduous classification for MTN.  2 and Table 4, we can see that the overall classification accuracy for genera increases from 88.7% to 91.0%.The omission error of pine decreases from 23% to 16% and the commission error of poplar decreases from 18% to 14% from STN to MTN.Although the classification of coniferous and deciduous is not our prime goal, the classification accuracy also increases from STN to MTN (89.7% to 91.9%).The omission error of coniferous trees decreases from 23% to 16% and the commission error of deciduous trees decrease from 14% to 10%.

Predicted
We have some insights to the results obtained from Table 2 to Table 5. STN derived classification features that are designed explicitly for genera classification while MTN derived the share classification features for both tasks.The shared features are more generalized and perhaps had reduced some of the overfitting problems compared to STN and therefore has higher classification accuracy (Table 2 and Table 4).The improved classification accuracies for the coniferous-deciduous classification (Table 3 and Table 5) show that the proposed loss function   plays an important role in constraining the inconsistencies between the two tasks in MTN network.We think this constraint is particularly successful because of the nature of the hierarchical classification of tree genus.

CONCLUSION
Our paper has two major conclusions; the first is the use of existing CNN tools for processing LiDAR 3D data in 2D space.
The advantage of such approach is that the classification features are automatically derived from the network and without human intervention.We had overcome the problem of insufficient training data by generating additional data though multi-view data augmentation.The second conclusion can be drawn from the successful results obtained from Multi-task Network (MTN).The genera classification accuracy had increased from 88.7% to 91.0%.The introduction of the constraint term  cn has shown to be useful in improving classification accuracy.In the near future, we will obtain more LiDAR trees with higher number of classes and we would like to process a higher complexity set of data with the same MTN.This paper has provided us a strong foundation for the future work.

Figure 1 .
Figure 1.Example LiDAR segmented tree crown, the top row shows five examples of pine trees, the middle row shows five examples of maple trees and the bottom row shows five examples of poplar trees, the greyscale of the points represents UTM northing (m).

Figure 2 .
Figure 2. Example of a pine LiDAR tree projected on to x-z plane (a) shows the results when the origin represents the x-y centroid of the tree, the lowest point of the recorded LiDAR point is zero.Grey value represents the location of the LiDAR point out of the paper (+) and into the paper (-) (b) shows the results of the same tree on the same axes where grey scale represents height above the lowest recorded LiDAR point.(c) and (d) are the resulted image produced from (a) and (b), respectively.(e) is the result of combining (c) and (d) into a 2channel image where (c) represents red channel and (d) represents green channel.

Figure 5 .
Figure 5. Summary of MTN network On top of the loss   introduced in STN, we proposed two additional losses for training the MTN.The first one is   , the loss for coniferous-deciduous binary classification (auxiliary negative to zero.As a result, the stack of images only contains positive numbers.Dropout is a regularization technique for reducing overfitting in neural networks.It randomly drops units (along with their connections) from the neural network during training.Max Pooling layer reduce the size of the stacked images from the previous layer, where max pooling refers to recording the maximum value of the pixel within the window size.The pooled images will become the input of the next convolution layer.Dense, or full connections between nodes in two layers describes a fully-connected layer.The fully connected layer (Fc1 and Fc2) is a layer where the feature values vote for different classes in classification.The last step, the Softmax layer produces probabilities for the genera labels.We use the following equation to calculate loss function   ():Let (  ,  , ),  ∈ {1, ⋯ , } be the th data in the training set, where   is the th input data which is 48 pixels × 48 pixels × 2 channels and  ge, ∈ {1, ⋯ , } is the label for   in the form of one-hot vector for the tree genera.The base network is trained to minimize the loss   () where: ,  , ; )  2 ( , ),  , ; ) = cross-entropy between  2 ( , ) and  ,  , = the M-dimensional vector, indicating the softmax output (predicted probability) for the tree genera classification of the th data  ,()= the th element of  ge, , e.g. if  = 1

Table 2 .
Confusion matrix for genera classification for STN

Table 4 .
Confusion matrix for genera classification for MTN with use of

Table 5 .
Confusion matrix for coniferous-deciduous classification for MTN with use of  By comparing Table