INTELLIGENT 3D CRACK RECONSTRUCTION USING CLOSE RANGE PHOTOGRAMMETRY IMAGERY

: Civil infrastructure Structural Health Monitoring (SHM) and its preservation from deterioration is a crucial task. In general, natural disasters like severe earthquakes, extreme landslides, subsidence or intensive floods directly influence the health of civil structures such as buildings, bridges, roads, and dams. Evaluation and inspection of defects and damages of the aforementioned structures help to preserve them from destruction by accelerating rehabilitation and reconstruction. An automatic and precise crack detection framework is required for periodic assessment and inspection due to the large number of the structures. In this study, a two-step crack segmentation and its 3D reconstruction procedure is proposed. The crack segmentation is carried out by using Deeplabv3+ architecture and Xception as the backbone. Next, Squeeze-and-Excitation is added as an attention module to achieve higher accuracy. Integration of predicted masks and original images into a structure-from-motion procedure is additionally taken into account. In the last step, ground control points and scale bars are considered to overcome the problem of datum rank deficiency in absolute orientation through the bundle adjustment procedure in aerial triangulation. The most probable segmented cracks are overlaid on the 3D point clouds in the global coordinate system with true scales. Our network is trained based on 8000 images and their corresponding masks, leading to 69% in Intersection over Union (IoU) index. Sub-millimetre accuracy of crack reconstruction using the proposed methodology is validated with a scale bar.


INTRODUCTION
Among 17460 in-serviced bridges in New York between 1992 and 2014, 98 bridges collapsed.Collapsing 46% of the bridges was due to structural deficiency, and 5% of them were associated with a life loss.The substantial influencing factor was the age of constructions.Therefore, it was suggested to regularly and precisely inspect and assess the bridges older than 52 years (Cook and Barr, 2017).The United States spends more than 200 billion dollars each year on the maintenance of facilities and public civil infrastructures.One third of approximately 576, 600 American bridges are structurally deficient, which may require repair, replacement or functionally obsolete.By increasing the age of infrastructures and also consideration of maintenance and repairs, their related cost increased sharply.Performing Structural Health Monitoring (SHM) in a regular manner leads to the condition assessment of the structures and the prediction of their remaining life time.On the one hand, SHM ensures the safety of the structures, and on the other hand, reduces maintenance cost (Giurgiutiu, 2014).However, it can increase life span by in-time deficient detection and rehabilitation.Furthermore, it provides additional information to be used in the design, operation and management of civil infrastructures (Vardanega et al., 2022).Cracks are the most common structural deficiency signatures.They typically occur due to overloading, drying shrinkage, thermal contraction/expansion, and structural deterioration.These structural damages should be sealed to ensure long-term durability and to prevent future cracking (Al-Mahaidi and Kalfat, 2018).Having prior information about the severity and extent of deficiencies accelerates rehabilitation.This investigation can be conducted visually, semi-automatically or automatically.The visual evaluation by human experts is the simplest method with high labour costs, unreliable, and the inspector-dependent results.Human response to the crack would be different and depends on the person's skills and knowledge.An expert considers a crack a nuisance while the other may consider the feature as a fine crack or vice versa.Therefore, making a decision based on human vision and skills is a time-consuming task with a certain level of reliability (Van Grieken, 2008).
Alternatively, the accuracy of crack assessment can be influenced by the choice of the sensors, detection, and reconstruction method.With improvements in robotics as well as the development of low-cost sensors like digital cameras, most attention is attracted to automatic methods (Zakeri et al., 2017).In other words, the advent of robotic-based inspection, e.g., drones, makes the SHM easier and a new era of technological activity (Ozer and Feng, 2020).Although SHM has many advantages but also has limitations impacting the reliability and accuracy of data collection.Besides the age of structures, differences in their shape and size are the other influencing factors.This variations in structure leads to the necessity for comprehensive unique monitoring techniques.However, the SHM standard technique should save time and cost, reduce the error of data collection, being accurate in the case of inspecting any kind of structures (Alokita et al., 2019).Segmentation is a fundamental task in photogrammetry and computer which enables to detect objects in the images based on their corresponding pixel homogeneity.It has many applications in autonom-ous driving, remote sensing, and medical image analysis (Hamishebahar et al., 2022;Hsieh and Tsai Yichang, 2020).Zakeri et al. (2017) performed comprehensive research about traditional image-based techniques for crack segmentation, detection, and classification.There are many traditional methods based on morphological operations (Landstrom and Thurley, 2012;Maode et al., 2007;Tanaka and Uematsu, 1998), edge detection (Abdel-Qader et al., 2003), thresholding (Akagic et al., 2018;Oliveira and Correia, 2009), and textural filters (Hu et al., 2010;Salman et al., 2013) which are obsolete due to their accuracy and sometimes disability to segment cracks.Some of the drawbacks of the traditional techniques are described in the following: • They are not generalised and objective methods.These methods are scene or image dependent, and their results vary from image to image (Zakeri et al., 2017).For instance, parameter determination in thresholding or morphological methods and their dependencies on image conditions.to the background.The low magnitude ratio of the crack pixel to the background pixel is another problem that traditional methods have to deal with.These approaches fail in complex backgrounds due to the aforementioned parameter • Texture-based and threshold-based techniques fail when cracks are embedded in surfaces with similar features like wires or tiles boundary.In addition, they will fail to identify the crack pixels in case of appearing shadow of branches of a tree on the road (Salman et al., 2013).• It is difficult to segment and quantify cracks, since cracks do not propagate in deterministic orientations.This is due to the random nature of the cracks and their various possible orientations.Zhang et al. (2013) proposed an algorithm to detect the cracks aligned at 12 different orientations based on a matching filter.Medina et al. (2010) assumed that cracks are aligned parallel and orthogonal concerning the road or pavement.• Traditional methods are sensitive to image degradation such as noise or low light conditions.• Oliveira and Correia (2014) and Tang and Gu (2013) assume a crack as a thin object which might not be the case.Scale-invariant methods are required which are independent of the magnitude of the crack's width, length, and depth.Therefore, methods that assumed crack as a thin object have less generality and might not be suitable.
The next generation of techniques is dedicated to machine learning methods.Support Vector Machine (SVM) (Li et al., 2009;Prasanna et al., 2016) and Random Forest (Shi et al., 2016) are the most popular and efficient algorithms.The main problem of the machine learning approaches is that they contain shallow learning and cannot deal with complex or dark scenes (Hsieh and Tsai Yichang, 2020).The state-of-the-art approaches are described in details in Wang et al. (2022).More specifically, Hamishebahar et al. (2022) and Hsieh and Tsai Yichang (2020) addressed the deep learning methods in crack identification.Hamishebahar et al. (2022) divided the task into image classification, object recognition, and segmentation.Hsieh and Tsai Yichang (2020) explained machine learning-and deep learning-based methods and reviewed public data sets and the evaluation metrics.Ronneberger et al. (2015) proposed a segmentation method for medical image analysis based on encoder-decoder architecture and convolutional neural networks (CNN) based on U-net.It was useful both in medical and non-medical image segmentation applications.Many crack segmentation techniques were developed based on the Unet (Jenkins et al., 2018;Liu et al., 2019;Qiao et al., 2021).Fu et al. (2021) used Deeplabv3+ (Chen et al., 2018) and Xception (Tang and Gu, 2013) for bridge crack segmentation.Zhang et al. (2020) added Squeeze-and-Excitation (Hu et al., 2018) and could improve the accuracy by adding a few parameters to a network.Zhao et al. (2022) proposed another architecture the so-called Crack-FPN for crack segmentation exclusively and reported 86% IoU.They have just trained their network on 717 images and tested them on 79 images, which were then compared with improved U-net.In addition to segmentation, the 3D reconstruction of the cracks is of great importance.3D reconstruction is known as a 3D representation of an object, which points can be obtained by photogrammetric methods or laser scanners.In computer vision, there are many methods implemented on multiple images instead of the stereo geometric images (Hartley and Zisserman, 2003;Stereopsis, 2010).Latter methods decrease occluded areas and help to have dense matching, dense point clouds, and better texture mapping.Ma and Liu (2018) addressed the main steps of 3D reconstruction and their corresponding references, such as feature extraction and matching, bundle adjustment, and 3D dense reconstruction.Then, they investigated the point cloud processing approaches and applications of 3D reconstruction in civil engineering infrastructures such as buildings, roads, and bridges.They also performed a crack assessment, deformation assessment, and bridge disease detection.Liu et al. (2016) proposed an image processing step to extract crack features before 3D reconstruction and project crack pixels with a pinhole camera model.They claimed that the main challenges in crack quantification with image-based techniques are absolute scale and camera orientation.They reconstructed point clouds to obtain working distance for accurate crack assessment.Shokri et al.
(2020) compared the two common segmentation networks Unet (Ronneberger et al., 2015) and Seg-net (Badrinarayanan et al., 2017) with different loss functions.Next, they chose the network with the best result and reconstruct the cracks.They also considered the condition of the crack which is distributed along the plane and performed one-step refinement in a bundle adjustment.Xue et al. (2022) proposed a Structure from Motion (SfM) based deep learning method to quantify the defects in tunnels.First, they segmented images with Mask R-CNN (He et al., 2017) and then reconstruct the 3D scene with SfM and CMVS.Next, they used the segmented image with Mask R-CNN for texture mapping and visualisation.The scale parameter was obtained with known scale bars through the tunnel route.Finally, the 3D segmented point clouds with true dimensions were obtained.
In this research, we develop and implement a deep learningbased method for crack detection and its 3D reconstruction which overcome the deficiencies of the traditional methods.
The strength of our work lies in the integration of Xception and Squeeze-and-Excitation modules as well as training the network on a large data set which therefore increases its reliability.
We used Deeplabv3+ with Xception as a feature extractor and Squeeze-and-Excitation to boost the result of crack segmentation.The crack pixels detected from captured images are segmented which are then input to the SfM algorithm to be 3D reconstructed.To specify the position of the cracks in 3D space with their true scales, Ground Control Points (GCPs) are measured which enables the estimation of similarity transformation parameters including one scale, three orientations, and three translation parameters.In this research, low-cost smartphones are used instead of metric digital cameras or Terrestrial Laser Scanners (TLS).The framework of our proposed approach is illustrated in Figure 1.
Figure 1.The framework of our proposed algorithm.

METHODOLOGY 2.1 Segmentation
To develop the framework in this research, Deeplabv3+ (Chen et al., 2018) is used which is a deep learning encoder-decoder architecture.In addition, Xception (Tang and Gu, 2013) network as the backbone is integrated to extract features.In addition, Squeeze-and-Excitation (Hu et al., 2018) module is considered to add channel-wise attention to our network.Figure 3 represented the proposed network architecture.It is shown Xception in the Deep Convolutional Neural Network section and Squeeze-and-Excitation block after the ASPP step.We introduced images and corresponding masks and trained our network.A combination of the prediction of the network with the original image was imported to the 3D Reconstruction.The segmentation is performed based on the following steps: 1. added images and their corresponding masks, 2. performed image preprocessing, 3. trained model and storing, 4. patchified images, and then performing prediction on the real image, 5. unpatchified predicted masks, 6. locate mask pixels on the original image.
Our network was trained on 128 × 128 images but our real images are larger than 128 pixels.Therefore, it is required to divide a large image into smaller sections by patching the large image and then performing the segmentation.Figure 2 depicts the patchified image to be ready for prediction.Table 1 presents the results on train and test data sets.We got 69% IoU in the training data set and 65% in the test data set.In addition, we got 99% and 94% in the Precision criterion for the train and the test data sets, respectively.Figure 4 shows three samples of test data sets.The first and second columns show the test images and their corresponding masks, respectively.The first two columns are related to the ground truth image and annotated mask.The third column indicates the predicted mask and the fourth column is a combination of images with the predicted masks.Figure 5 depicts the performance of the network on real images.Although we trained our model on a large data set, we could not increase the epochs by more than 500 according to the limited sources.There were the pixels that were recognised as a crack feature while, they were not cracks (false detections or false positives) like tiles intersections, black cables, the boundary of windows, and any other similar features.There were crack pixels that were not segmented by the network (false negatives).It is obvious that there is not a method or model to segment any feature entirely with 100% accuracy but, by increasing the number of training images and epochs, it is expected to have a more accurate model.Moreover, the resolution of training images impacts the segmentation quality.We might get better results by training our model on 416 × 416 × 3 images instead of 128 × 128 × 3, since this space will let the model distinguish between crack and non-crack pixels during the patch prediction.However, some of the aforementioned errors and weaknesses are unavoidable because of the limited hardware like GPU and RAM and it is not due to the network architecture.Furthermore, to get better results, adding a data set that is similar to our condition and workspace is helpful.By training the network with a large data set and fine-tuning with conditionally similar images, our model becomes more compatible with our work.Although adding similar images cause better results, it should be considered that image annotation is also a time-consuming and costly task.The usage of encoderdecoder network architecture, strong feature extractors as backbone, and large data sets improve the flexibility and reliability of our work while the model is trained in fewer epochs.The model will be more accurate while the models are trained in thousands of epochs and be prevented from overfitting.

Reconstruction
To reconstruct the cracks, the SfM algorithm is applied which consists of feature extraction, feature matching, bundle adjustment, and sparse/dense point cloud generation.The use of individual masks which have been predicted in the segmentation step failed due to imperfect feature extraction and feature matching procedures.Furthermore, the lack of textures as well as the nonexistence of the accurate segmentation network, impose additional problem to the SfM step when using only mask images as input.
To solve the aforementioned problem, the segmented crack pixels are specified and highlighted with red colour on the original images.Next, the original images and their corresponding  segmented masks are considered within the point cloud generation procedure.Existence of the high-quality textures and distinctive features enable the algorithm to successfully reconstruct the whole scene and cracks.By doing this, not only problem of network deficiency in the segmented masks will not destruct the SfM, but also missed cracks will be compensated in the original images.In addition, crack pixels are emphasised and considered with more importance and attention.Next, a statistical noise reduction is conducted on a sparse point cloud and points with high re-projection errors are removed.Afterward, dense point cloud generation and texture mapping are carried out, respectively.Image segmentation followed by the  3D reconstruction can be an alternative to 3D segmentation derived by deep learning methods.Figure 6 represents the point clouds which are reconstructed from the images captured by a smartphone.Figure 7 depicts the 3D segmented crack points in the object space, which are highlighted with red colour at cropped sections.Although point clouds are generated, it is still required to be georeferenced.To solve the absolute orientation problem, the 10 coded targets as GCPs and 2 scale bars are added.Therefore, they allow to transform the model space to the real object space with local or world coordinates.
The GCP points are measured by the total station of type Leica TC03, which are later extracted in captured images.To verify the accuracy, the GCPs and scale bars are divided into check and control points.By solving the absolute orientation problem, the point cloud is aligned in true orientations, translations, and scale.Figure 8 indicates the segmented point cloud in true position and scale.Finally, orthophoto and DEM are generated.Reconstruction error is defined as the distance between the input and re-projection points that are defined based on the position of the markers and their estimates.The root mean square error (RMSE) is calculated for the markers and over all images where each marker is visible in at least two images.Table 2 provides information about reconstruction and re-projection errors of check and control points which are calculated in both millimetre and pixel units.Consequently, 1.2 mm in the reconstruction error and 1.031 pixels of the re-projection error have been calculated considering the control points.However, 1.2 mm and 1.177 pixels have been calculated in checkpoints.
For control and check scale bars, the precision of 0.9 mm and 0.98 mm are calculated, respectively.It means that measuring the distances with a precision of less than one millimetre is feasible.The Ground Sampling Distance (GSD) is defined as the distance between two consecutive pixels on the ground which is equal to 0.36 mm in this research.By considering the pixel as a square, each pixel in the image space is equivalent to 0.12 mm 2 in the object space.
The optimum number of images is important in the 3D reconstruction.Although increasing the number of images causes a large coverage area and high-quality model texture, it increases the run-time processing and computation.However, by using less number of images occluded areas remain.Furthermore, the integration of tilted images and stereo images reduces occluded areas by covering the fovea and bulge.Domical imagery geometry with tilted images, shown in Figure 9, causes the tie points to be observed on various images and overlaps are increasing.As a result, details of specific features will be appeared in the model and are measured accurately.In the one hand, if the crack features are extracted in various images with different perspectives, the cracks will be reconstructed, textured, and quantified more accurately and precisely.It decreases the occluded areas as well.On the other hand, image resolution and quality play a significant role in precise 3D reconstruction.High-resolution and noise-free images with appropriate light conditions are also necessary.Using calibrated high-resolution metric cameras instead of non-metric low-cost sensors can improve the results in 3D reconstruction.However, systematic and random errors during land surveying must be checked and compensated.Calibrated measurement tools with expert operators reduce the errors like sighting or instrument leveling.It should be noted that precise and accurate absolute orientation parameters are directly influenced by the precision and accuracy of target positions that are obtained by land surveying.

EXPERIMENTAL DETAIL
This research was conducted on a personal computer with specifications of, intel CORE i7-10750 2.60 GHz, 16 GB RAM, and GEFORCE GTX 1660 ti as GPU.We used 8000 images with a size of 128 x 128 including all kinds of cracks and noncrack scenes for training and 2000 crack images from another data set for network validation.We also used Intersection over Union (IoU), Recall, and Precision as metrics for both train and test evaluations as well as Dice loss as a loss function.We trained our network with 500 epochs.For reconstruction, we added 33 images to the SfM network.Images were taken from Samsung A-71 with a focal length of 5.23 mm and a resolution of 4624 x 3468 pixels.

CONCLUSION
We proposed a framework to accurately reconstruct crack features in 3D space.It is attempted to decrease the probability of missing cracks.First, crack pixels were segmented with Deeplabv3+ which achieves an IoU of 65%.Next, mask images as well as original images were additionally used as input for the reconstruction step.Subsequently, it helps missing important crack pixels by performing one level emphasising.To transform points from the model space to the real object space in either local or global coordinates, measuring the crack size, and assessing the propagation of the cracks accurately, GCPs were added.The reconstruction error of 1.2 mm, both in control and checkpoints was obtained while the check scale bars reached the precision of 0.98 mm.With an imaging resolution of 0.12 mm 2 , cracks in a range of the aforementioned value were detectable and distinguishable.In order to merely reconstruct the crack features, the original images and mask images were added into the SfM procedure to obtain 3D position of the cracks in the model space.It is expected that by increasing the accuracy and precision of the whole scene's reconstruction, crack pixels will be reconstructed more accurately.Furthermore, there are many deficiencies in 3D reconstruction and 3D segmentation steps using deep learning approaches.Point clouds derived from deep learning methods are not accurate.
In addition, they are not aligned in true position, orientation, and scale.Fortunately, 3D reconstruction by photogrammetric methods overcomes the following deficiency.Due to the lack of the data set for 3D segmentation and 3D reconstruction of the cracks by deep learning methods, the use of artificial intelligent approaches is less possible.By considering the framework proposed in this research, it is feasible to obtain the segmented crack features in the 3D space with true and accurate coordinates.

Figure 3 .
Figure 3. Architecture of network for crack segmentation.

Figure 4 .
Figure 4. Prediction of network on the test data set.

Figure 5 .
Figure 5. Prediction of network on real images.

Figure
Figure 6.Dense textured point cloud.

Figure 7 .
Figure 7. Segmented crack points in the point cloud.

Figure 8 .
Figure 8. Segmented point cloud including crack points in true coordinates.
Crack pixels are a minor part of an image in comparison •