MULTI-ATTENTION GHOSTNET FOR DEFORESTATION DETECTION IN THE AMAZON RAINFOREST

: Efficient deforestation detection techniques are essential to monitor and control illegal logging, thus reducing forest loss and carbon emissions in the Amazon rainforest. Recent works based on Deep Learning (DL) models have been proposed for that purpose. DL-based methods, however, are known to require large amounts of training data to be properly trained. Moreover, the deforestation detection application is characterized by a high class imbalance, as recent deforestation areas usually represent a small fraction of the geographic extents being monitored. In order to produce a lightweight architecture in terms of the number of learnable parameters and address the high class imbalance of the deforestation detection application, we propose a DL model based on the GhostNet architecture, which combines Ghost modules in a fully convolutional architecture. The proposed architecture also includes Spatial Attention Mechanisms attached to the skip connections of the GhostNet in order to better capture the spatial relationships among class features. Experiments were carried out using Sentinel-2 images of a region in the Para´ state, Brazil, in the Amazon rainforest. The results obtained show that the proposed model achieves accuracy levels that are superior to those delivered by state-of-the-art DL architectures, with a lower computational cost due to the smaller number of learnable parameters.


INTRODUCTION
Comprising 5.5 million km 2 , 60% of which located in Brazilian territory, the Amazon is the largest rainforest, and one of the most biodiverse ecosystems on Earth (Lorenz et al., 2021). Unfortunately, in the past decades the biome has experienced sustained threats caused by human intervention. The large scale use of land for agricultural activities have induced biodiversity loss and degradation, carbon emissions, water pollution and deforestation (Morris, 2010). In particular, deforestation in the Brazilian Legal Amazon (BLA) has increased considerably and alarmingly annually (Lorenz et al., 2021). According to (Pereira et al., 2020), if deforestation reaches around 40% of the total forest area it can produce an increase in global temperatures of up to 4°Celsius, and that catastrophic scenario can happen in this century if the current rates are kept. Therefore, the implementation of wide-reaching monitoring systems is of paramount importance to understand the nature and effects of the underlying processes, and to efficiently combat illegal logging.
The Brazilian National Institute for Space Research (INPE) monitors the annual deforestation rates in the BLA since 1988 through the Program for Deforestation Monitoring in the Brazilian Legal Amazon (PRODES) (INPE, 2021). As classification accuracy is extremely important in that application since the reported figures are official ones, PRODES's methodology still involves significant human intervention. In this regard, automatic solutions are necessary in order to reduce human effort as well as the time taken to perform the underlying tasks. * Corresponding author In recent years, DL-based methods have become the state-of-the-art in many computer vision fields, including image classification, object detection, and semantic segmentation (Alam et al., 2021). With Deep Neural Networks (DNNs) it is possible to learn robust representations that can improve prediction accuracies (Srivastava and Biswas, 2020). However, to achieve suitable performances, conventional DNNs demands large volumes of training samples, and are characterized by large numbers of parameters and high computational costs (Paoletti et al., 2021).
Indeed, one of the recent trends in DL is the design of high performance DNNs with portable and efficient architectures (Han et al., 2020).
For instance, MobileNet (Howard et al., 2017), ShuffleNet (Zhang et al., 2018) and GhostNet (Han et al., 2020) employ depthwise and pointwise convolutions to replace traditional convolution layers, and drastically reduce the number of weights to be learned during training. Those structures allow creating efficient architectures, with fewer learnable parameters and state-of-the-art performance.
In this work we propose a new Fully Convolutional Network (FCN) architecture, inspired by the GhostNet and containing attention modules. The architecture was evaluated in a deforestation detection task, in a particular region of the Amazon forest. The model was designed to comprise a reduced number of parameters and calculations, as well as to deal with high class imbalanced scenarios, as it is the case of the target application, since recent deforestation areas usually represent a small fraction of the geographic extents under study. We further compared the outcome of the proposed method with those delivered by alternative architectures, and analyzed the uncertainty in the predictions generated by all the methods.
The remainder of this paper is structured as follows: Section 2 reviews related works about deforestation detection and attention mechanisms. Section 3 introduces the proposed method and the uncertainty metrics used in this work. Section 4 describes the experimental protocol and discusses the obtained results.
Finally, the main conclusions drawn from the experimental analysis are presented in Section 5.

DL-Based Deforestation Detection
Some recently published works have demonstrated promising results for deforestation mapping using FCN architectures. Specifically considering deforestation detection in Amazon sites, Bem and co-authors (de Bem et al., 2020)  Similarly, (Ortega et al., 2021) compared three established FCN architectures, namely U-Net, ResU-net, and a Siamese Network, employed for deforestation detection in the BLA using data from different sensors: Landsat-8, Sentinel-2, and Sentinel-1. In most cases, the ResU-Net architecture delivered the best results and identified the deforested areas with higher precision.
Additionally, a number of similar works employed the U-Net model for deforestation mapping, and reported interesting results, e.g., (Wagner et al., 2019, Bragagnolo et al., 2021b, Bragagnolo et al., 2021a. All the previously mentioned works, however, are based on deep CNN architectures characterized by a large number of trainable parameters, and thus dependant of large sets of labeled training samples in order to avoid over-fitting and produce good performances, especially considering the class imbalance in deforestation detection application.

Deep Attention Mechanisms
In the context of DL, attention mechanisms (AM) attempt to mimic the way the human brain processes information (Ghaffarian et al., 2021). One of the key characteristics of human perception is that we tend not to process all available information at once. Indeed, humans have a tendency to selectively focus on one piece of information when and where it is needed, while ignoring other perceptual information (Niu et al., 2021).
AM was proposed by Bahdanau and co-authors in the context of neural machine translation (Bahdanau et al., 2014). Subsequently, the underlying structures and adaptations were employed in other applications, including computer vision, image processing and remote sensing (Koščević et al., 2019, Zeng et al., 2020, Niu et al., 2021. Recent works have demonstrated the potential of AM to improve the DL approaches, e.g., (Gao et al., 2020, Qing and Liu, 2021, Xue et al., 2021. The fundamental idea of an AM is to assign different weights to different pieces of information, thus allowing DL models to focus on and identify relevant features for particular tasks (Hu et al., 2018).
Recent literature show encouraging results of AM employed in remote sensing, and in particular in change detection applications (Chen et al., 2020, Jiang et al., 2020, Lu et al., 2021, Guo et al., 2021. In that context, AM is used to enhance feature representation of image information, improving discrimination between changed and unchanged regions. Those studies seem to indicate that the inclusion of AM can improve the performance of state-of-the-art DL approaches applied to deforestation detection. Indeed, in (Tovar et al., 2021), a siamese network with Spatial Attention Mechanism (SAM) and Channel Attention Mechanism (CAM) was evaluated for detecting deforestation in a region of the Amazon rainforest. The results showed that these dual AM improved the performance of the network. In addition, the authors reported that the spatial information is more relevant for AM than the channel information. However, as it is a conventional network, it requires a large number of learnable parameters, which can be optimized with more efficient networks.

METHODOLOGY
In this section, we explain the proposed method, starting with the description of the proposed Multi-attention GhostNet architecture. In sequence, we describe the metrics used in the experimental analysis to measure the uncertainty of the method's predictions.

Multi-attention GhostNet
Inspired by the GhostNet architecture, introduced in (Han et al., 2020), we propose a fully connected architecture that includes a Spatial Attention Mechanism (SAM), for the deforestation detection task. As we are concerned with detecting changes, the network receives as input two co-registered images acquired at different dates, represented as IT 0 and IT 1 . The images are stacked along the spectral dimension, producing a tensor I ∈ R H×W ×C , where H and W denote the spatial dimensions, and C the number of image channels.
The proposed model follows a symmetric encoder-decoder architecture with skip connections, as can be observed in Figure 1(a). The encoder network is composed of several Ghost blocks blocks with residual mappings. The structure of a Ghost block is illustrated in Figure 1(b)), it consists of two stacked Ghost modules, with a conventional (Conv) and a depthwise convolution (DWConv). The structure of a Ghost module is shown in Figure 1(c). This module starts by applying a primary convolution to produce the intrinsic feature map. Then, a series of cheap linear operations (Φi) are applied to obtain the final ghost feature map, through a depthwise convolution. During this process, an identity mapping is employed to preserve the intrinsic feature maps. The Spatial Attention Mechanisms (SAM) included in the skip connections help to combine low-and high-level feature maps (Section 3.2). The decoder network is composed of sequence of bilinear up-sampling and convolution operations, and its output is a tensor with the posterior class probabilities for all spatial locations.

Spatial Attention Mechanism (SAM)
The Spatial Attention Mechanism (SAM) (Woo et al., 2018) leverages from inter-spatial relationships of features to produce x Conv.
Conv. a spatial attention map. Its structure is presented in Figure 2.
To highlight informative regions, first average-pooling and max-pooling operations are computed and their results concatenated. That is followed by a convolution and a sigmoid layer, resulting on spatial attention Ms features, which are multiplied by the input F to generate the output feature tensor F'.

Uncertainly
In this section we present the three metrics used to quantify the predictive uncertainty: predictive variance, predictive entropy, and mutual information. In this work, we employed deep ensemble modeling, which is an effective strategy to measure uncertainty performance of supervised learners (Abdar et al., 2021). We trained several models, with the same architecture, but with different random weight initializations and random batch selections.

Predictive Variance
Let's denote y (i) (p) n i=1 as the set of n different predictions (softmax) at pixel coordinate p for all K classes. Also, y (i) k (p) stands for the i−th element of y i (p) corresponding to the prediction for class k at pixel coordinate p. Then, the final prediction µ k (p) for pixel p and class k is the average over all n predictions y i k (p) in a pixel-wise fashion (Seeböck et al., 2019): The variance for each class k is computed by: The final predictive variance for pixel p is obtained by averaging all estimate over the k class-specific variances:

Predictive Entropy
Entropy provides a measure of the average level of information or uncertainty inherent to the probable outcomes of a random variable (Shannon, 2001). Predictive Entropy can be defined as follows:

Mutual Information
Mutual information measures non-linear relations between two random variables. It expresses how much information can be obtained from a random variable by observing another random variable. The mutual information for a pixel p is the difference between the predicted entropy computed on the final prediction and the average of the entropies of each prediction:

EXPERIMENTS
In the following sections we present the design and results of the detect deforestation experiments carried out in this work. We start by describing the dataset used to train and evaluate the proposed method. Next, we detail the experimental setup, and finally, we analyze the effects of including the spatial attention mechanism (SAM) in the network architecture. For that purpose, we compared the accuracy obtained with the proposed Multi-attention GhostNet with that of the fully convolutional GhostNet without SAM. To serve as baselines, we also trained and evaluated two variants of a compact ResU-Net model (with the same number of layers of the implemented GhostNet), with and without SAM.

Study Area
The study area corresponds to a region of the BLA, located in Pará State, Brazil. The site is centered on coordinates of 06 • 54' 16" South and 055 • 11' 52" West (see Figure 3).  Figure 3. Geographical localization of the study area.
The dataset comprise two coregistered Sentinel-2 images, downloaded and preprocessed using the Google Earth Engine (GEE) platform (Gorelick et al., 2017). The images were processed to Level-1C, which means they are orthorectified, map-projected and contain top-of-atmosphere reflectance data. The input I to the networks tested here was a stack of image bands of size 9200 × 17730 × 20, as we only considered the bands with 10m and 20m of spatial resolutions (we applied the nearest-neighbor interpolation to the 20m resolution bands). Moreover, each band was individually normalized to zero mean and unit variance. We observe that the dataset is very unbalanced. According to the PRODES reference, only about 1.13% of the area is associated with the deforestation class, that is, with the deforestation that occurred during the selected period; 61.71% belong to the no-deforestation class; and 37.16% correspond to past deforestation, areas deforested prior to the selected period.

Experimental Setup
The dataset was divided into 20 tiles, each one with a size of 2300 × 3546 pixels, maintaining a distribution of 40% , 10%, and 50% for training, validation, and test, respectively. The network was trained on patches. In all experiments, patches of size 128 × 128 pixels were extracted from the input image, with stride equal to 32.  Figure 1(a)). Additionally, the following parameter values were used in all experiments: batch size equal to 32; Adam optimizer with learning rate equal to 1e −3 , and β equal to 0.9. In order to prevent over-fitting, the early stopping strategy was used.
Considering that the dataset is highly unbalanced, we set the weighted cross entropy as a loss function with a vector of weights equals to [0.2, 0.8] for class no-deforestation and deforestation, respectively. Furthermore, to ensure that all the patches contain samples from both classes, only patches with at least 2% of pixels from the deforestation class were used for training. Data augmentation operations were employed for the training patches: rotation (90 • ), and flipping (horizontal, vertical) transformations.

Results and Discussion
In this section, we present and analyze the results obtained using four methods: a standard ResU-Net; a Multi-attention ResU-Net; the fully convolutional GhostNet; and the Multi-attention GhostNet.
The results are summarized quantitatively in terms of classification accuracy metrics values, and qualitatively, though deforestation probability maps and uncertainty maps.
The experimental results in terms of Precision vs. Recall curves, computed on the average prediction map, are shown in Figure 4.
Additionally, Table 3 reports the Recall, Precision, F1-Score and Mean Average Precision (mAP) for the four methods. Observing the curves in Figure 4 it is possible to notice that the four architectures presented a similar tendency, close to ideal case (upper-right axis), with a slightly better performance for the GhostNet and the Multi-attention GhostNet, which is reflected in the mAP metrics in Table 3. It can be also observed in Table 3, that the GhostNet variants are consistently superior to the ResU-Net ones in terms of Recall, Precision and F1-score; and the Multi-attention GhostNet produced the best overall results.   Figure 5 shows the deforestation probability maps produced with the four methods. Those maps also represent the average prediction map obtained from each model. It presents three different snips of test tiles. The first two columns on the left represent the co-registered pair of images from the different epochs (RGB composition), i.e., T0 and T1. The third column shows the reference mask, in which the blue color represents no-deforestation; the red color represents deforestation, and the black color, past deforestation. The four columns on the right contain the probability maps produced by each method. In the prediction maps for snips a) and b) one can notice that the Multi-attention GhostNet provided more confident values, as well as better defined polygons. In addition, considering snip c), it is possible to observe that part of the T1 image is covered by clouds. The ResU-Net variants identified some cloud parts as deforestation. The GhostNet variants were able to better classify those areas, demonstrating their superior robustness in that type of scenario.
Finally, we analyze the uncertainty associated with the predictions of the network ensembles in terms of Predictive Variance, Predictive Entropy and Mutual Information. Table 4 presents the uncertainty scores that correspond to the classification of pixel positions with each ensemble, by setting a threshold equal to 0.5 for the averaged output probabilities. The values in the table are averaged uncertainty values computed for pixels that correspond to: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Although expected, it is interesting to observe that for all architectures the uncertainty associated with the correctly classified pixels (TP and TN) is much lower than for the incorrectly classified ones (FP and FN) -the ensembles usually fail when its components disagree to a larger extent. Additionally, in the case of the correctly classified pixels, lower uncertainties occur in the classification of no-deforestation (TN), that is, the ensembles are less confident in the classification of the deforestation class (TP). Also referring to Table 4: in most cases the architectures with SAM showed lower uncertainties than their counterparts; the GhostNet-based architectures provided lower uncertainties than the ResU-Net variants; and the proposed Multi-attention GhostNet consistently outperformed all other architectures in terms of the uncertainty of its predictions. For a visual interpretation, Figure 6 shows the uncertainty maps obtained for each method over snip c). It is easy to notice that the plain ResU-Net architecture provided the highest uncertainly values. When SAM is included in the ResU-Net, however, uncertainly was reduced. Moreover, the GhostNet variants delivered more confident maps, i.e., with lower uncertainty values, the best of which are associated with the proposed Multi-attention GhostNet.

CONCLUSIONS
This work introduced a novel fully-convolutional architecture based on the GhostNet, which includes Spatial Attention Mechanisms (SAM). The model was employed in deforestation detection in a particular site in the Amazon region, a problem that is characterized by a high-class imbalance.
The experimental results demonstrated that the proposed Multi-attention GhostNet consistently outperformed the baseline approaches: ResU-Net, Multi-attention ResU-Net, and GhostNet without SAM, considering all classification accuracy metrics evaluated. Furthermore, the results also showed that the inclusion of SAM in both ResU-Net and GhostNet led to improvements in classification accuracy.
Additionally, according to three uncertainly metrics, we investigated the predictions of all methods in terms of their uncertainties. We found that the inclusion of SAM in both the ResU-Net and GhostNet architectures led to less uncertainty in the predictions. The GhostNet-based architectures were superior to the ResU-Net-based ones in that respect. In conclusion, the Multi-attention GhostNet model produced the most accurate and less uncertain classification maps in the deforestation detection task, at least for the study area considered in this work.
In the future, we plan to enrich the experimental analysis by considering different sites, with varying types of forest, in the Amazon and the Brazilian Cerrado (Savannah) biome. We also want to assess the generalization capacity of the proposed model and possibly use it as a backbone for Domain Adaptation solutions. Finally, we plan to investigate ways to employ uncertainty information to improve semantic segmentation results in change detection tasks.  . The rows correspond to: predictive variance, entropy and mutual information. Blue and red colors represent lower and higher uncertainly scores, black regions correspond to past-deforestation.