COASTAL HABITAT MAPPING WITH UAV MULTI-SENSOR DATA: AN EXPERIMENT AMONG DCNN-BASED APPROACHES

: With recent abundant availability of high resolution multi-sensor UAV data and rapid development of deep learning models, efficient automatic mapping using deep neural network is becoming a common approach. However, with the ever-expanding inventories of both data and deep neural network models, it can be confusing to know how to choose. Most models expect input as conventional RGB format, but that can be extended to incorporate multi-sensor data. In this study, we re-implement and modify three deep neural network models of various complexities, namely UNET, DeepLabv3+ and Dense Dilated Convolutions Merging Network to use both RGB and near infrared (NIR) data from a multi-sensor UAV dataset over a Norwegian coastal area. The dataset has been carefully annotated by marine experts for coastal habitats. We find that the NIR channel increases UNET performance significantly but has mixed effects on DeepLabv3+ and DDCM. The latter two are capable of achieving best performance with RGB-only. The class-wise evaluation shows that the NIR channel greatly increases the performance in UNET for green, red algae, vegetation and rock. However, the purpose of the study is not to merely compare the models or to achieve the best performance, but to gain more insights on the compatibility between various models and data types. And as there is an ongoing effort in acquiring and annotating more data, we aim to include them in the coming year.


INTRODUCTION
After its development for military applications, unmanned aerial vehicle (UAV) has become a popular tool for civil applications [Pajares, 2015]. As UAVs can also operate at much lower altitudes as to satellites, equipped with various sensors, it provides a flexible and cost-effective approach is to acquire a lot of high resolution information over an area of interest.
As part of a Norwegian infrastructure project, the goal is to establish drone-based mapping and monitoring of the coastal environment. Automated image analysis via deep learning needs to be implemented for mapping habitats including seafloor substrate types, subsurface vegetation and other managementrelevant species. This task of mapping every pixel in an image can be referred as semantic segmentation in computer vision.
The current main stream approach is based on Deep Convolutional Neural Networks (DCNNs) [LeCun et al., 1989, Krizhevsky et al., 2012, Simonyan and Zisserman, 2015, He et al., 2016, Sandler et al., 2019, deployed in a fully convolutional fashion [Long et al., 2015]. These Fully convolutional Networks (FCNs) replaces the fully-connected layer in a classification network with convolution layers, effectively extends image-level classification to pixel-level classification.
The repeated combination of max-pooling and striding of consecutive layers of DCNNs [LeCun et al., 1989, Long et al., 2015 is a most common technique in DCNN models, but the repeated use of it is known to significantly reduces the spatial resolution of the resulting feature maps. Two different approaches have been employed to address this problem. One of * Corresponding author: yiliu@nr.no them is by using transposed convolution or upsampling [Zeiler et al., 2011, Noh et al., 2015, Long et al., 2015, Ronneberger et al., 2015. In-network upsampling has been observed [Long et al., 2015] to be fast and effective for learning dense prediction. A typical example for using upsampling is UNET [Ronneberger et al., 2015], which has an symmetric encoder-decoder architecture. The other approach is by using dilated convolution (atrous convolution) [Holschneider et al., 1990,Chen et al., 2017. And a well-known example for using atrous convolution is DeepLabv3 [Chen et al., 2018b], which has an asymmetric encoder-decoder architecture. Its Astrous Spatial Pyramid Pooling (ASPP) module uses atrous convolution and pooling operation to capture features at multiple scales. [Chen et al., 2017] further extend Deeplabv3 to Deeplabv3+ to recover detailed object boundaries by concatenating the low-level features with bilinearly upsampled features from ASPP encoder. Instead of a parallel design in ASPP module, [Liu et al., 2020b] present Dense Dilated Convolutions Merging (DDCM) Network which employs dilated convolution in a cascading structure [Yu and Koltun, 2016a]. Their model achieves good performance on IS-PRS Potsdam and Vaihingen data [Kaiser et al., 2017], as well as the DeepGlobe land cover [Demir et al., 2018] dataset.
Due to the nature of CNN structures, the receptive field is limited to local regions [Luo et al., 2016], which imposes an adverse effect on the performance of FCNs. To address this, several approaches have been proposed. Dilation convolution [Holschneider et al., 1990] operation is a common technique to expand the receptive field [Yu and Koltun, 2016b,Chen et al., 2017, Zhao et al., 2018, Liu et al., 2020b, but there is a heavy overhead cost for large dilation rate. The use of pooling operation is another approach for capturing long-range dependency. Global average pooling module is proposed in ParseNet , different-dilation based atrous spa-tial pyramid pooling (ASPP) module is in DeepLab [Chen et al., 2018a] and different-region based pyramid pooling module (PPM) is in PSPNet [Zhao et al., 2017].
In terms of network structure, most models adopt an encoderdecoder structure [Badrinarayanan et al., 2017, Noh et al., 2015, Ronneberger et al., 2015, as it not only helps refine segmentation masks but also helps building contextual information.
Besides the complexity of networks, another factor that affects the result is the type of input data. As drones are equipped with multiple sensors, there is usually extra information such as multispectral data in addition to conventional RGB. How to make full use of the data available given the choices of many available neural network models is worth of investigation.
In this paper, we re-implement and adapt three encoder-decoder structure models, UNET [Ronneberger et al., 2015], Dee-pLabv3+ [Chen et al., 2018a] and Dense Dilated Convolutions Merging Network (DDCM) [Liu et al., 2020b], to include multi-sensor data. In particular, we train with conventional RGB and near infrared (NIR) data, and compare with the results that are trained with RGB only. A consistent Res-Net50 [He et al., 2016] is used as backbone. The models are different in terms of the use of dilation, pooling operation, and how low-level features and high-level ones are combined. We show both overall and class-wise score for all models trained with and without the NIR data and discuss the result. In addition, an adaptive class weighting loss is implemented to account for the high imbalance of label categories.

Deep Neural Network Models
2.1.1 UNET UNET [Ronneberger et al., 2015] features a symmetric encoder-decoder architecture that is originally proposed for cell segmentation in microscopy images, but has become a baseline model in remote sensing. According to its original design, the encoder (the red contracting path in Fig. 1a) consists of repeated convolutional layers (3x3, unpadded), followed by a rectified linear unit (ReLU) and max pooling (2x2 with stride 2) for downsampling. The resulting condensed highlevel feature maps are then processed in the decoder (the green expansive path in Fig. 1a). In the decoder, each level consists of bilinear upsampling, followed by a 2x2 convolution, a concatenation with the cropped feature map from the corresponding encoder, and two 3×3 convolutions (each followed by a ReLU). In this implementation, the contracting path is replaced by the backbone ResNet50 (down to layer3).

DeepLabv3+
DeepLabv3+ [Chen et al., 2018a] features Atrous Spatial Pyramid Pooling (ASPP) module and the combination of dilated convolution and spatial pyramid pooling [Grauman and Darrell, 2005,Lazebnik et al., 2006,He et al., 2014, Zhao et al., 2017. Dilated convolution can be viewed as a generalized convolution, which modifies filter's field-of-view by the rate value, and it has been widely used in modern convolution neural networks. The ASPP module (denoted by the red dashed box in Fig. 1b) applies several parallel dilated convolution with different rates and uses adaptive pooling subsequently to capturing multi-scale features. and c) DDCM-Net. Upsample refers to bilinear upsampling. Backbone feature extraction is denoted by yellow dashed boxes. Model specific modules are denoted by colored dashed boxes. The size of feature maps are noted as fraction of its original size and reflected by the size of the planes. The numbers in green denotes image scale, red denotes dilation rate and black denotes number of feature channel.

Dense Dilated Convolutions Merging (DDCM)
Network DDCM [Liu et al., 2020b] is another example of achieving contextual aggregation through dilated convolution. There are two differences compared to DeepLabv3+ architecture. One is the design of DDCM module. In its simplest form with only one dilation rate value, it consists of dilated convolution with the given rate, followed by PReLU [He et al., 2015] non-linear activation, batch normalization (BN) [Ioffe and Szegedy, 2015] and concatenation with the input. Given a sequence of rates, the DDCM module repeats this operation in a cascade manner, so the number of feature maps increases with the number of rates given. The rates are indicated by the red numbers in Fig. 1c. The final step in DDCM module consists of of 1×1 convolution, BN and PReLU to reduce the number of output features. The other difference is that the DDCM module is applied multiple times in the network (see Fig. 1c), first directly on the image level, then twice on the high-level features from the backbone layers. Maxpool2d is used as downsample method, indicated by the red arrow in Fig. 1c. The (1/2)x size feature maps extracted by DDCM in the two branches are concatenated as input to a 3x3 convolution layer, before applying bilinear interpolation to recover its full resolution (denoted by blue dashed arrow as "classify and upsample" in Fig. 1).

Multi-channel input modification
To make the models general for UAV multi-sensor data, the number of input channels in the backbone is modified. And in addition, to benefit from a pre-trained network, the initial weights of the first three channels (RGB) is kept and copied for the extra channel.
For the experiment in this study, all three models are implemented based on the architectures illustrated in Fig. 1, with a pre-trained ResNet50 as backbone, and trained using consistent protocols for comparison.

Adaptive Class Weighting Loss
2.2.1 Main loss function As label classes are highly imbalanced in this dataset, how to obtain meaningful updates on the weights for the minority classes needs to be addressed for efficient training. We address this problem by using an adaptive class weight loss. A customized loss function [Liu et al., 2020a], where a median frequency [Eigen and Fergus, 2015] class weight sampling method based on iterative batch-wise class rectification [Kampffmeyer et al., 2016], is used. The total loss function is formulated as a combination of a positive and negative class balance (PNC) function Lpnc and dice loss L dice [Milletari et al., 2016], where wi,j is the pixel-wise adaptive class weights for the i-th pixel of the j-th class, N the total number of pixels, C the total number of classes, and * denotes element-wise multiplication.
The PNC function is based on the L2 least squares error where L2 = C j |yi,j −ỹi,j| 2 2 , with yi,j ∈ (0, 1) as the probability of the i-th pixel to be j-th class andỹi,j ∈ {0, 1} the ground truth.
The dice loss emphasizes the measure of intersection over union and can be written as (3)

Iterative median frequency class weights
wheref n j denotes the pixel frequency of the j-th class, the number of pixels of class j divided by the total number of pixels, at the current n-th iteration,and f 0 j = 0. Then we update the iterative median frequency class weights by with a damping factor ϵ of 1e-5. Finally, the pixel-wise adaptive class weights is computed by wi,j = fw n j j (fw n j ) (1 + yij +ỹij) , 3. EXPERIMENTS

Dataset
The data are acquired over the coast of Akerøya (shown in After pre-processing procedures such as orthorectification, image stitching, radiometric calibration quality assessment of the data, removal of personal/sensitive information and metadata, a total of six sub-images are made available to us. The raw RGB dataset is supplied as a single, 4-band (RGBA) GeoTiff with 2.2 cm cell resolution and 8-bit uint data type.
The multispectral dataset has a 9.3 cm cell resolution and values are stored as 32-bit float data type. Due to the differences, a single, consistent dataset containing all the bands of interest is first created, using a 5 cm cell resolution and a bit-depth conversion of 32-bit to 8-bit. The combined dataset has 8 bands (rgb-red, rgb-green, rgb-blue, multispec-red, multispec-green, multispec-blue, multispec-nir and multispec-rededge). We will focus on the RGB and NIR in this study, but the other bands, such as red edge, will also be used in the future.
There are in total 9 classes that are annotated. An statistical overview of the images and classes is shown in Tab. 1. We can see that there is a very large class imbalance, where the minority classes such as green algae, red algae and lichen each accounts for less than 0.5% of the total number of pixels. Based on the statistics, image 1&6 are selected as the test set and the rest as the training set. However, as part of the NIR data is missing in image 6, the scores for models trained with NIR data are calculated using image 1 only.

Data Set Parameters
Due to the large size of the images and to increase model robustness, random image crop is

Optimizer, Learning Rate and Loss
The Adam [Kingma and Ba, 2014] with AMSGrad [Reddi et al., 2018] is used as the optimization algorithm, where the weight decay for non-bias weight parameters is set as 2e-5. A multi-step learning rate (LR) scheduler is used where γ = 0.8, steps = 2, and an initial LR of 6e-5. The LR to bias weight parameters are set as twice as non-bias weight parameters. Within each epoch, a polynomial decay (1 − iter/itermax) 0.9 is used to adjust LR, where itermax is the maximum number of iteration. When LR becomes smaller than 3.28e-6, a constant LR of 1.8937e-6 is used.
The models are implemented with PyTorch and run on a workstation with two NVIDIA GeForce RTX 2080Ti 12GB GPUs.

RESULTS
For inference, a 448×448 tiling window with a stride of 100 is used on the test images (area 1&6). In addition, horizontal and vertical flip are applied to each image patch. The inference result then is reverted back to the original orientation before np.argmax is applied on the class axis for final prediction map. Under the same initial settings, all models are trained for 10 epochs and the last updated model is used for inference. A visual comparison of the test image (area 1) is shown in Fig. 3.

Evaluation Metrics
For quantitative evaluation, standard segmentation metrics that are based on pixel accuracy and region intersection over union are used [Long et al., 2015]: • pixel accuracy: i nii/ i ti • mean accuracy: (1/n cl ) i nii/ti • mean IU: (1/n cl ) i nii/(ti + j nji − nii) • frequency weighted IU: where nij is the number of pixels of class i predicted to be of class j, n cl the total number of classes and ti = j nij the total number of pixels of class i.

Overall performance
The overall evaluation metrics are summarized in Tab. 3 & 4. They show the scores for RGB-only models (evaluated on area 1&6) and NIR models (evaluated on area 1), respectively. In both tables, the highest score per metrics (column-wise) is highlighted in blue. For RGB-only models, DeepLabv3+ outperforms DDCM marginally, but both outperform UNET by at least 6%. However, this margin is significantly reduced in Tab. 4 by the inclusion of NIR. Although the best scores are still from the more sophisticated models (DeepLabv3+ and DDCM), the margin is greatly reduced. Within the same model category, UNET(NIR) outperforms the RGB-only version in all measures, while DeepLabv3+(NIR) underperforms in all measure and DDCM(NIR) shows a mixed result.
Recall that DeepLabv3+ and DDCM use various dilation rates and spatial pyramid pooling design, the result shows that such feature pyramid design with varying dilation rates is successful in capturing mutli-scale context and helps boosting model performance using RGB-only data.

Class-wise performance
For per-class performance, class ACC ( i nii/ti) and class IU (nii/(ti + j nji − nii)) are computed and shown in Tab. 5 & 6. The noticeable improvement (>5%) between the same model type is highlighted in gray, and the best score for selected classes is highlighted in blue.
In the ACC scores (Tab. 5), we see that the NIR data bring big improvement for UNET in classifying green algae, red algae, Rock, Vegetation, especially red algae, where it has the highest score among all model types. But for green algae, lichen and vegetation, it's still DeepLabv3+ and DDCM have the highest scores. In the IU scores (Tab. 6), it shows the similar results except for lichen, where UNET without NIR actually has the highest score, but only marginally to DeepLabv3+.
Overall, we find that introducing the NIR channel into training does not bring significant performance improvement for Dee-pLabv3+ and DDCM, but does make a difference for UNET. We suspect that this is because that the NIR channel brings the contextual information that UNET needs more than Dee-pLabv3+ and DDCM. The latter two, by design, have modules that enable them to capture mutli-scale context using RGB information alone and to achieve high performance. RandomCrop (1.0), VerticalFlip (0.5), HorizontalFlip (0.5),RandomRotate90 (0.5) * Implemented using Albumentations library [Buslaev et al., 2020] Figure 3. Example of predictions on a test image.

CONCLUSION
We made a unified re-implementation of three neural network models with distinctive architecture and complexity for general use with UAV multi-sensor data. We tested on a high resolution dataset acquired for coastal habitat monitoring. The conventional RGB data and the NIR band from multisepctral sensor are used.
We observe that neural network models with high contextual information aggregation capacity are important for achieving satisfactory performance if there is only conventional RGB data available. And simply adding additional data from extra sensor, NIR in our example, in training existing complex deep neural networks does not warrant a performance gain as one might expect. This performance gain could be expected from baseline models though, namely UNET in this study, especially for vegetation related classes. In our experiment, the best performance is achieved from the more complex models using RGBonly data but the gap is much reduced in baseline model when NIR is included.
Furthermore, we verify that dilated convolutions with multiple rates on high-level features and its fusion with low-level features is an effective approach for contextual information aggregation. The use of customized loss function with adaptive class weighting is also found to be effective in training with the highly imbalanced data. We aim to include more UAV multisensor data to further investigate these findings in the coming year.