DEEP LEARNING FOR AUTOMATIC BUILDING DAMAGE ASSESSMENT: APPLICATION IN POST-DISASTER SCENARIOS USING UAV DATA

During the last few years, the technical and scientific advances in the Geomatics research field have led to the validation of new mapping and surveying strategies, without neglecting already consolidated practices. The use of remote sensing data for damage assessment in post-disaster scenarios underlined, in several contexts and situations, the importance of the Geomatics applied techniques for disaster management operations, and nowadays their reliability and suitability in environmental emergencies is globally recognized. In this paper, the authors present their experiences in the framework of the 2016 earthquake in Central Italy and the 2019 Cyclone Idai in Mozambique. Thanks to the use of image-based survey techniques as the main acquisition methods (UAV photogrammetry), damage assessment analysis has been carried out to assess and map the damages that occurred in Pescara del Tronto village, using DEEP (Digital Engine for Emergency Photo-analysis) a deep learning tool for automatic building footprint segmentation and building damage classification, functional to the rapid production of cartography to be used in emergency response operations. The performed analyses have been presented, and the strengths and weaknesses of the employed methods and techniques have been outlined. In conclusion and based on the authors' experience, some operational suggestions and best practices are provided and future research perspectives within the same research topic are introduced.


Geomatics tools for damage assessment
Disasters that hit communities and places strongly affect our history and culture. Nevertheless, disaster events can be monitored, sometimes predicted and their effect mitigated using the right tools. Among different strategies and techniques applied in this specific context, Geomatics had a prominent role in documenting and archiving spatial 3D information, using image-based and rangebased metric survey systems. The easiness of deploying UAV platform and performing aerial photogrammetric survey made aerial photogrammetry the best rapid mapping method for performing emergency cartography, supporting the intervention in the field of firefighters and first responders (Calantropio et al., 2018) Different kinds of damage classification grades have been introduced over time. While Copernicus EMS uses five damage classes (Grünthal, 1998), the BAR Methodology uses four classes ("critical visible damage"; "significant visible damage"; "minimal visible damage" and "no visible damage"); the UNOSAT classification uses a binary approach ("damaged"; "not damaged"). An exhaustive analysis of the different damage scales used for building damage assessment by the main satellite-based emergency mapping service has been discussed (Cotrufo et al., 2018), stating that different damage classes and detailed interpretation guidelines with operational examples are crucial for assuring the suitable analysis of the analyzed data. * Corresponding author Damages can be categorized based on the observation of the building's morphological features and their spectral features (distribution of brightness, the regularity of the texture, etc.) (Li & Tang, 2018). A method is to use both pre-event and post-event data, which unfortunately are not always available -like pre-event and postevent TerraSAR-X (TSX) radar images, VHR remote sensing imagery from satellite (including DigitalGlobe's WorldView satellites), UAV and ground-level images. Building footprints are usually the easiest kind of data to retrieve and are obtainable from different sources, such as pre-event vector maps or cadastral maps. These sources are unfortunately not always up to date; for this reason, when building vector footprints are not available or outdated, it is possible to exploit other kinds of pre-event data, such as VHR satellite image or aerial image that (if relatively recent) allows a manual interpretation of the building's boundaries (Li & Tang, 2018). Some experience (Cotrufo et al., 2018) underlined that it is impossible to directly apply the damage classification scales proposed (Grünthal, 1998) for addressing slight structural damages using remotely sensed images. Even if satellite images can achieve sub-meter resolution, there are still issues in identifying partially damaged buildings; for this reason, data integration from UAV-derived information remains crucial (Li & Tang, 2018). Pre-earthquake condition imagery data can be retrieved from online platforms that use different sources for providing RGB orthophotos (derived from UAV or aerial or satellite photogrammetry). Unfortunately, online data are limited in time and space (i.e., only certain regions are mapped).

WFP experience in Mozambique after cyclone Idai: Artificial intelligence for geospatial data analysis in emergency contexts
In recent years, AI (artificial intelligence) techniques, such as ML (machine learning) and, more specifically, DL (deep learning), have evolved, supporting human activities in different fields.
Apart from examples related to disaster management and damage assessment based on pre-event and post-event data (Nia & Mori, 2017), there are wide-ranging examples of Deep Learning techniques such as convolutional networks for biomedical image segmentation (Ronneberger et al., 2015) employed for medical purposes such as cancer detection (Hu et al., 2018;Shen, 2017;Shen et al., 2017) 1 . 2012 was a crucial year for Deep learning, after discussing the ImageNet contest results (Krizhevsky et al., 2017) and the demonstrated capability to recognize objects comparing millions of images without human-codified instruction. When big data trained-sets are available, deep learning constitutes a good alternative compared to classical algorithms such as Support Vector Machine (Cortes & Vapnik, 1995) or Random Decision Forests (Ho, 1995), which usually better perform on structured data, such as tables and images. After an emergency strikes, people affected are in the most vulnerable situation and taking the right decision at the right time is fundamental to minimize the impact on their lives. The WFP (World Food Programme) has seen that with an appropriate UAV prepositioning workflow (developed during the SEARCH project) 2 , it is possible to collect high-value data in a small amount of time. The most significant limitation of this approach, amplified by the offline environment, is that it takes days to manually analyze the information in a structured and reliable way. WFP, therefore, developed its open-source solutions tailored to the emergency context where object recognition can enrich and expedite humanitarian decision-making: DEEP (Digital Engine for Emergency Photo-analysis) 3 . After Idai hit Mozambique in June 2019, DEEP was tested in Maputo, Mozambique, with partners from the INGC (Instituto Nacional de Gestão de Calamidades -National Institute of Disaster Management -Mozambique) and students from Eduardo Mondlane University. Ten participants annotated UAV imagery during a ten-day training workshop, later fed into the DEEP algorithm, thereby grasping machine learning basics (Codastefano, 2019). DEEP automated the analysis and processing of high-resolution images -a process that can significantly speed up humanitarian response 4 . Its ease of use and modular design based on traditional computing (instead of cloud computing) is a perfect solution for emergency settings where power is limited and internet connectivity patchy. The software, designed by TECF, can be installed on any laptop to run a model that can automatically extract objects from drone imagery, classify them as damaged or intact, and plot these on a map ( Figure 1). As such, models need to be trained and able to use drone images at various resolutions, depending on resource availability and field conditions. With the correct data, DEEP can also help with diverse tasks such as identifying standing water pools following a cyclone to help monitor potential cholera and malaria outbreaks.
In 2019, Cyclones Idai and Kenneth ravaged large swaths of southern and northern Mozambique, destroying homes, livelihoods and displacing nearly two million people. Working closely with the government and partners on the field, it was the first time WFP deployed drones during an emergency response. A few months later, INGC and WFP drones' combined effort provided a wealth of valuable data and insight, including maps and aerial imagery. Using this data is a unique opportunity to develop machine learning capacity in a real-world situation. DEEP is wholly based on Open-Source dependencies and specifically (but not only) on QGIS 5 and GDAL 6 . During its development, the QGIS guidelines have been followed to create a widely compatible and standardized solution that can provide a high degree of deployability comparable to similar commercial solutions. Because a neural network is not capable to efficiently work on a large image (such as a UAV-generated orthoimage), the primary requirement and the preliminary step of the workflow is to subdivide it into tiles of 1024x1024 pixels each; DEEP exploits the VRT format to keep the georeferentiation of the generated tiles, simplifying the ML task. The other important characteristic of DEEP is that it is possible to access different libraries for managing the VRT (thanks to GDAL), recompacting files, using a filter (SIEVE) that allows ignoring segmented geometries with an area lower than a specified number of square pixels and, in the end, transforming the generated raster of the segmentation into a geojson file using the option "poligonize" (initially bugged, this option has been corrected in the framework of this research). The original idea of the workflow on which DEEP is based is inspired by the Zanzibar Aerial Mapping Project 7 adapted with a custom implementation to extract the building segmented by the segmentation model (building/not building) to reclassify them according to the damage level (not damaged/damaged/empty). UNET 8 (Ronneberger et al., 2015), based on KERAS 9 , was successfully tested on the Zanzibar dataset, and therefore implemented in DEEP. To tackle the emergency generated by Idai, the model was re-trained using images acquired on the field after the disaster. In March 2020, the Open Cities AI Challenge: Segmenting Buildings for Disaster Resilience 10 took place, and after studying the result of the challenge, it turned out that the 3 rd classified solution, provided by Michael Busta 11 and based on (Lin et al., 2016), had suitable characteristics for being included (after practical implementations in pre-processing and post-processing) in DEEP, therefore substituting its original embedded ML segmentation model. The damage classification model considers a minimum tile that includes the previously segmented buildings and classifies them according to the damage. The level of damage is assigned as an attribute of the previously segmented shape of the building footprint.
Because image classification (a whole image is classified and a single label is given as an output) is generally more straightforward than image segmentation (each pixel of the image is classified and a label is assigned to each pixel) (Blaschke, 2003), through negative mining, it is possible to create an "empty" class to detect (using the classification model) the segmented part of the images that were erroneously classified as buildings, to purge them from the final classification output. In this way, the "empty" class allows an optimization of the segmentation output, exploiting the fact that the classification model performs better in this task compared to the segmentation one. DEEP works well with UAV-generated orthoimages at the current optimization and development (from 6 to 9 cm/pix GSD). For this reason, the use of satellite images, which are usually lower in resolutions, does not produce good results.

Politecnico di Torino experience in 2016 Central Italy earthquake
The 2016 earthquake experience in central Italy has been a validation field for many survey and mapping strategies proposed during recent years by the Geomatics research community. Exploiting the last decade's technical and technological advances for emergency response purposes is one of the more crucial topics, intending to obtain and efficiently organize high-scale reliable geospatial data for the early warning, impact, and recovery phases. In this sense, Geomatics applied techniques in DM (disaster management) are primarily devoted to enhancing search and rescue, analysis and assessment, damage monitoring, and emergency management activities. BDA (building damage assessment) operations involve experimental sensors and approaches that work together or are combined with more traditional and consolidated practices. Remote sensing data in BDA is traditionally used through visual interpretation of operators; this is a time-consuming task. The new requests from the FR (first responders) are nowadays oriented towards quasi-real-time data processing and fast production of emergency cartography on different scales. Based on previous similar experiences (Calantropio et al., 2018), UAV flights over damage sites capturing high-resolution video footage and photos for obtaining 3D models, and orthoimages using SfM techniques are nowadays consolidated methodologies for obtaining rapid and reliable information of the areas affected by a disaster. In this scenario, different tests have been carried out during the post-earthquake documentation activities conducted by the Politecnico di Torino Geomatics group in cooperation with the SAPR team of the Italian Fire Fighters and the GEER association 12 (Geotechnical Extreme Events Reconnaissance association). Politecnico di Torino is an Italian technical university that has contributed to this direction, applying knowledge and resources to the study of calamities and disasters and their effect on the natural and built environment. Especially after the seismic swarms that hit Italy in August 2016, a task force made of volunteers among professors and researchers was created to document and study each phase of the event and their effects and guide the necessary interventions. Several UAVs and sensors were employed to define standards and guidelines for data acquisition and processing according to the different platforms, payloads, imaging sensors, and required accuracy of the final output. Summarising the obtained results and according to the achieved final products, it is possible to define some essential items that need to be considered to get suitable outputs for the BDA operations. Concerning the image-acquisition phase, it is recognized that, in addition to the nadir acquisitions, the use of oblique images (Aicardi et al., 2016;Duarte et al., 2017;Ezequiel et al., 2014) is crucial and fundamental for improving the 3D reconstruction part (especially in urban areas) and it is also essential during the BBA (bundle block adjustment) since the use of those images allow to increase the rigidity of the photogrammetric block and, consequently, allowing the possibility of decreasing the number of the used GCPs (ground control points) when platforms without RTK/PPK are employed (Nesbit & Hugenholtz, 2019;Ostrowski & Bakuła, 2016). Another important aspect is related to the GCPs acquisition and the accuracy evaluation. If, on the one hand, it is crucial to obtain data very quickly during an emergency, on the other hand, it is essential to verify the final accuracy of the achieved output. To follow these objectives, using a minimum number (4-5) of wellmeasured points on the field is an essential requirement in case of acquisition performed using platforms without RTK/PPK capability. Following this practice could be a plus for the platforms with accurate onboard GNSS since the use of the surveyed data is necessary for evaluating the final accuracy (using them as checkpoints) or to improve the quality of the camera calibration (Gabrlik et al., 2018). Finally, concerning the data-processing and apart from the possibility of having in a short time reliable and high-resolution 3D models useful for BDA activities, the possibility of combining multi-temporal acquisitions in a typical project (Aicardi et al., 2016) allows the monitoring of the affected area across time accurately. Following the strategy mentioned above, the experiences carried out during the 2016 Earthquakes were deployed mainly using COTS (commercial off-the-shelf) multi-rotor and fixed-wing platforms. In September, October, November 2016, and February 2017, the acquisitions were performed over different villages with multi-sensors documentation of buildings and urban areas heavily and repeatedly damaged by the numerous seismic events. Several villages have been surveyed, such as Pescia, Pescara del Tronto, Cittareale, Accumoli, Norcia, Castelluccio, Amatrice, Campi di Norcia, etc. Multi-scale documentation has been achieved to allow the different researchers involved in the emergency activities to enrich the knowledge of the areas affected by the events.
In the next chapter of this paper, the first results of the use of DEEP are described and analyzed, starting from the achieved orthophoto to the performed building damage assessment.

APPLICATION OF DEEP ON THE 2016 CENTRAL ITALY EARTHQUAKE DATASET
According to the experiences carried out by the WFP on the development of DEEP and the expertise of the Geomatics group of the Politecnico di Torino, in 2020, a joint research project focused on testing the possibilities offered by the proposed Deep Learning approach on the data collected during the 2016 postearthquake surveys. The first tests related to the Village of Pescara del Tronto (Figure 2), seriously affected by the different shocks, are reported in detail in the following sections. These tests' primary objective is to develop an entirely new automatic methodology for damage assessment, not affected by a subjective evaluation that could be found in the Copernicus EMS classification.

Segmentation model
The first test concerned the segmentation model of DEEP that can be used (in addition to the existing cartography) to automatically detect the building footprint, which will be later classified by the classification module. After the preliminary test, the building segmentation model was performed as expected on the Pescara del Tronto orthophoto ( Figure 3). Therefore, there was no need to perform any additional adjustments to the model. The preliminary evaluation of the results showed that a minor amount of non-building items (roads, parking lots, etc.) was segmented (recognized as buildings). To mitigate the abovementioned issue, a "sieve" filter has been applied before generating the final building footprint vector, excluding all the items with an area lower than 5 m 2 . It is essential, however, to state that the segmentation model is valid only when the building footprint is still clearly recognizable; for this reason, in case of destroyed buildings, in a situation in which even a trained operator would have difficulties in recognizing the original building footprint, it is necessary to employ an existing layer of cartography that can provide the building footprint. Therefore, the segmentation model will be used to detect new buildings or additions built after the ones represented in the existing cartography to provide a comprehensive and updated cartographic base for the following BDA. This step's output is a GeoJSON format file containing each shape of the buildings with a unique ID and a closed geometry.

Classification model
The second step was using the image classification model, previously trained with Mozambique and Zanzibar open data, for running a preliminary test using the Keras implementation of the VGG16 classification algorithm (Simonyan & Zisserman, 2015) on the orthophoto of Pescara del Tronto (Figure 3), generated by using UAV images gathered on August 2016 (after the first seismic swarm). The results are shown in the following image ( Figure 4). As it is possible to observe, the results are inconsistent and do not resemble the actual damage classification provided by Copernicus, sometimes providing random outputs. The reported results were expected because the typology of debris produced by Mozambique's catastrophic event are substantially different from those in central Italy. It is essential to clarify that the challenge of applying this system and method to different cities and countries is related to the fact that various material and construction techniques generate different debris patterns; therefore, the features recognized by the DEEP classification model are not comparable at this stage.   With the aim of improving the results obtained in this test, the classification layers of the CNN (Convolutional neural network) of the image classification model have been re-trained using data samples obtained from the UAV surveys performed in three other villages affected by the same seismic swarm in Central Italy ( Figure 5). Thus, the additional samples were selected from the orthophotos of the cities of Amatrice, Accumoli, and Norcia (August 2016). Moreover, the preliminary test showed the adopted solution was still not optimal, as there were again some areas of the orthophoto segmented as building. This problem outlined the need for more advanced training with three classes: "not damaged", "damaged", and "empty". This additional class "empty" can be later used to refine the erroneously segmented items, eliminating them before the BDA operation. Figure 5. The image shows the neural network blocks that have been re-trained in two steps: at first, the classification layers (block 6); then the classification layer plus the last block of the feature extraction part (block 6 + block 5).

Training and validation: the dataset
The However, even if the initial data was already labelled and divided into five damage classes, according to EMS98 (Grünthal, 1998), the different classes were highly unbalanced. Moreover, Copernicus's damage classification was performed on aerial data (lower resolution than the UAV orthophotos used in this research). The classification was subject to human errors, as it was an operator-dependent classification. For the reasons mentioned above, the data have been divided into 2 classes of damage, using a binary classification of "not damaged" (class 0) and damaged (class 1). As previously mentioned, a class "empty" containing negative examples have been introduced.
The dataset was therefore divided as follows: -80% Training set. -20% Validation set. The data augmentation parameters used for the training phase are reported in the appendix at the end of the article. With a batch size of 8, there were 660 steps per epoch.

Training and validation: the results
Step A: At first, only a re-training of the classification layer has been performed. Since the model was already pre-trained on the Mozambique data, there was no need for a long training process. It was interrupted after epoch 4 -validation accuracy improved from 0.91 to 0.95. The details of this step are reported in Table 1 and Table 2 Table 2. Main classification metrics after the first training (epoch 4 -validation accuracy improved from 0.91 to 0.95validation loss: 0.14). The table shows a summary of the precision, recall, and F1 score for each class.
Step B: In this step, a fine-tuning of the whole neural network has been performed, re-training not only the last layer as done in the first step but also the last feature extraction block. The process was interrupted after epoch 2 -validation accuracy improved from 0.89 to 0.97. The details of this step are reported in Table 3 and

Binary shades for multi classes classification
After the training and the validation phases, the classification model has been tested again on the Pescara del Tronto orthophoto. It is essential to clarify that the models have never seen Pescara del Tronto's data during the training and validation phases.
After the segmentation, the buildings detected by DEEP were subtracted from the buildings provided in the existing cartography, generating a layer containing only buildings not present in the current cartography. The 3-class classification model ("not damaged"; "damaged"; and "empty") has been performed only on this last layer, as it might contain wrongly segmented items (erroneously segmented as buildings). The items classified as "empty" were therefore discarded. The other buildings extracted from the existing cartography were classified using the 2-class classification model ("not damaged" and "damaged") because it was sure that they contained only buildings.
To further improve the quality of the final output, and to make it comparable with other damage classification standards (such as EMS-98), instead of assigning to each building a value "not damaged" or "damaged", it was considered the classification likelihood score (from 0 to 1) of belonging to one of the two classes. This value has been assumed as a function of the level of damage for each of the analyzed buildings. The whole workflow is summarized in Figure 6, and the classification output is reported in Figure 7. All the steps detailed in this chapter were undertaken to refine the results initially obtained in Figure 4. The output is a heatmap of the damages that exploits the doubts of the classification model, using the probability of a building to be "not damaged" or "damaged" and assigning a score from 0 to 1 (from white to red). However, it is essential to check how the model performed compared to the ground truth data (obtained by Copernicus EMS).

Copernicus verification and comparison
It is not easy to define at the current moment the interval of the DEEP's scores that correspond to each of the classes in compliance with the EMS-98 classification adopted by Copernicus. For this reason, a confusion matrix of the actual and predicted level of damages cannot be provided. To give an idea for quantifying the method's performance in this specific case, it is possible to provide a qualitative comparison between the likelihood score assigned by DEEP (0-1 white-red color ramp) and the damage level set by Copernicus (Figure 8). At this stage, the possibility of assigning, for each class, an interval of the score is possible. It will require further tests involving a higher number of ground-truth data that can be used for a suitable calibration of the score conversion. It will also be necessary to re-create a ground-truth dataset, as Copernicus's classification is not only subject to bias because of its operator-related nature but also performed on aerial images with a lower GSD that makes it difficult to assign the right level of damage unequivocally. Moreover, the next step of the implementation will be to use 3 level of damage ("not damaged"; "damaged"; "destroyed"; plus the "empty" class) because it is relatively easy to distinguish between "damaged" and "destroyed" buildings, but the thematic accuracy of the intermediate classes of the EMS-98 is not sufficient due to the right-from-above point of view of the orthophotos (Cotrufo et al., 2018). This additional step will, of course, require a re-training and a new fine-tuning of DEEP with opportunely classified classes.

Figure 8.
For each of the classified buildings, a box plot diagram shows the damage class assigned by Copernicus (on the x-axis) and the relative likelihood score given by DEEP (on the y-axis) for each of the items (buildings). The whiskers indicate the variability outside the first and last quartiles; the X symbol indicates the average value for a given class.

CONCLUSIONS
During the development of the tests related to this research, it has been observed that DEEP works well with orthoimages that have a GSD of 9 cm ± 3 cm; this is a good compromise in terms of the orthoimages' quality, the number of images required for its generation, flight altitude, and time of flight. For example, a DJI Phantom 4 RTK can obtain a 9 cm/pix GSD with a flight altitude of about 330 m, while the SenseFly eBeeX can get a 9 cm/pix GSD with a flight altitude of approximately 390 m. It is important to note that the optimal resolution in real case scenarios is a function of the flight altitude (and not vice-versa) because in the early aftermath of the emergency the drone might have to perform a flight. Simultaneously, other aircraft (such as helicopters) might need to operate to assist with the ground operations and logistics. It has also been observed that higher resolutions (lower GSD) introduce false positive detections during the segmentation phase and increase the overall processing time; lower resolutions (higher GSD) instead does not make the classification model working correctly, making it difficult or impossible to detect the level of damage accurately. The proposed method can deliver a thematic map for a very early assessment of the situation in less than one hour after the arrival on the site (10-15 min flight time, 5 min data transfer, 20 min orthophoto generation, 10 min DEEP classification). The DEEP segmentation model is helpful if one wants to update the existing cartography if the retrieved building shapes are not updated.
The following steps will be devoted to investigating this tool's potentialities in real case applications, thanks to an ongoing collaboration with the Italian National Fire Corps -Corpo Nazionale dei Vigili del Fuoco, in two steps. A first step through creating a comprehensive dataset of orthophotos acquired in emergencies that can provide valuable material for the optimization of DEEP, to improve its performance in the future. A second step concerns the automatization of the processing part (reading the orthophoto, auto-detect the GSD, and opportunely set a value for the sieve filter, and so forth) to enhance the algorithm deployability on the field. Principal errors are related to the fact that the training set does not contain a relatively sufficient number of terraces or other kinds of roofs and some dubious situations where even a human operator will have difficulty assigning a damage level. Per the publication date of this paper, this is the very first research presented about the DEEP network (that has not yet -but will be soon -publicly released and will be freely accessible and usable). Future research work will be focused on implementing an instance segmentation model that will substitute the current segmentation model (due to some issues reported in the case of buildings too close to each other, recognized as a single entity). Moreover, the possibility of performing a visual interpretation of more orthoimages will provide an up-to-date and reliable ground truth that will allow performing a more comprehensive validation of the employed method.