AU-Net: A Deep Learning Network for Precise Water Body Extraction in The Middle And Lower Reaches of The Yellow River

Accurate and effective extraction of water body information is an important prerequisite for hydrological studies of the Yellow River. However, there is a scattered and frequently swing water flow in the middle and lower reaches of the Yellow River. Traditional water body extraction methods mainly rely on handcrafted statistical features, which cannot fully extract river body in real-world conditions. To deal with these problems and achieve more accurate results, an AU-Net network is proposed to expand the receptive field of the convolutional kernel and incorporate the detailed information of multi-scale features, which improves the ability to extract the middle and lower reaches of the Yellow River from remote sensing images. The experimental results illustrate that compared to the other methods, the AU-NET model has higher recognition accuracy (MPA= 0.97and MIoU=0.99) on the water body dataset in the middle and lower reaches of the Yellow River. And the network has high robustness and good fitting, which can better extract the middle and lower reaches of the Yellow River.


INTRODUCTION
As the second-longest river in China, the Yellow River has been regarded as the cradle of the Chinese civilization and its development has an important impact on the country's economy, agriculture, and water resources (Baoping & Yuanbo, 2021).However, soil erosion, water shortage and flooding in the middle and lower reaches of the Yellow River have harmed the region's natural landscape, hindered the production and livelihood of the people in the middle and lower reaches, and negatively impacted China's social and economic development.These issues of the middle and lower reaches of the Yellow River must be urgently addressed.Thus, the accurate and effective water body information extraction is crucially significant for the Yellow River protection, management, and flood prevention & mitigation.(HU & Zhang, 2018;Ting, Huaibao, Yuanjian, Kunpeng, & Weimin, 2019;Zhong et al., 2021).However, the Yellow River's middle and lower reaches are siltation-prone due to their high sediment content.The river and its branches are very volatile and changeable, leading to a wide and shallow river with a braided distribution and sandbars in the river (Zhang, Shang, Cui, Luo, & Zhang, 2022;Zhong et al., 2021).Due to this complex river characteristics, it remains challenging to accurately and effectively extract water bodies in the middle and lower reaches of the Yellow River from remote sensing images.
Water body extraction has received a lot of attention, with classical approaches and deep learning-based methods accounting for the majority of these efforts.There are two categories for classical water body extraction methods, i.e., unsupervised and supervised classification.Among the unsupervised classification methods, water body index is one of the most popular algorithms, by examining the spectrum characteristics of water bodies in each wave band, and thresholding to distinguish water bodies from non-water.Xu et.al discovered that water bodies are more prominent based on mid-infrared, and the MNDWI index has been suggested and widely used (Han-qiu, 2005); Feyisa et al. constructed an automatic water body extraction index, which allows the simultaneous extraction of multiple water bodies.(Feyisa, Meilby, Fensholt, & Proud, 2014) On the other hand, the supervised classification approach finds picture elements with similar features in the image by different algorithms based on spectral similarity and categorizes the similar elements.Chen et al. consider the spectral and spatial characteristics of features and construct a knowledge decision tree to specifically extract urban water bodies (Jing-bo, xi, Cheng-yi, cheng, & Zhong-wu, 2013); Elmieta et al. precisely tracked three different hydrological features of the Niger and Congo rivers using Markov random field methods (Elmi, Tourian, & Sneeuw, 2016).Although above methods have achieved fair water body extraction performance from remotely sensed images with simple & straight river, they greatly rely on manually selected samples and fine-tuned thresholds.When dealing with the complicated background of the Yellow River's middle and lower reaches, these methods may fail in accurate river body extraction, suffering from unclear river boundaries, omission or misclassification, etc..With recent advents in deep learning technology, it becomes popular to extract water bodies from remote sensing images with DL-based methods.Yuan et al. suggested a deep convolutional encoder-decoder architecture for extracting water bodies using very high-resolution images (Yuan et al., 2021); Li et al. developed a DeepUnet model for automated sea-land segmentation, which is free from the human influence in traditional methods, and capable of improving the accuracy of water body extraction in complex backgrounds.(Li et al., 2018) As one of the most common encoder-decoder structure, the Unet network, suggested by Ronneberger et al (Ronneberger, Fischer, & Brox, 2015), uses the cascade operations between encoder and decoder to fuse high-level information with shallow-level information.The advantage lies in avoiding the loss of high-level semantic information as well as preserving image features as much as possible.Nevertheless, the U-Net network is bound to lose much feature information during multiple sampling and result in limited performance in narrow rivers recognition.
In order to address the above issues, this study proposes a new network model, namely AU-Net, to better extract river features at different scales, by introducing the Atrous Spatial Pyramid Pooling model to the original U-Net network model.This algorithm expands its convolutional kernel receptive field, and obtains image feature of river information of different scales through parallel four-layer convolution and then fuses it to alleviate the difficulty of accurate extraction water body in the middle and lower reaches of the Yellow River.
The rest of the article is structured as follows: Section 2 describes the proposed network model's structure.The results of the experiment have been shown and assessed in Section 3. Section 4 presents the convergence of the network model and a comparative study the NDWI method.Finally,, we conclude this article in Section 5.

Overview of the AU-Net
The AU-net network is developed based on the classic U-net network structure, which could be divided into two main parts, i.e., the encoding network, and the decoding network.In Fig. 1, the architecture is displayed.Each downsampling stage of the encoder uses two identical convolutional layers of size 3  3 for feature extraction, each followed by activation using the ReLU function, and the feature map is then downsampled using a maximum pooling operation of size 2  2. The ASPP model is introduced before upsampling the feature maps.Each downsampling stage of the encoder uses two identical convolutional layers of size 3  3 for feature extraction, each followed by activation using the ReLU function, and the feature map is then downsampled using a maximum pooling operation of size 2  2. The ASPP model is introduced before upsampling the feature maps.
The feature maps after this convolutional layer contain rich spatial semantic information, and the ASPP can fuse deeper image detail information to ensure that the network model extracts highly correlated river detail features in the coding structure, and transports the relevant feature information to the decoding structure to improve the extraction accuracy of river water body information.The decoder's deconvolution layer performs upsampling operations on the image feature information, stacks two convolutional layers in each upsampling stage, and fuses features from different levels of the feature map using a jump connection to gradually recover the feature map, making the input image and the output image constant in size.The network parameter settings and feature map variations of each layer in the network model in this paper are shown in

Parameter setting for the U-net section
The U-net is a semantic segmentation network with a classical encoding-decoding structure, which is flexible, migratory, and generalizable because of its fully symmetrical structure, and is compatible with a wide range of optimization strategies.It has both a systolic path that captures contextual information and a symmetric extension path that allows precise localization, which allows the network to propagate contextual information to a higher level of resolution.The skip connection is a key module of the U-net network architecture, which enables the transfer of feature mappings from the encoder to the sibling decoder.By fusing the shallow detailed semantic features with the deeper abstract semantic features through skip connections, the network is able to balance semantic features of different sizes and depths.
The structure of the U-net network is shown in Fig. 2. The encoder part is a downsampling operation consisting of a maximum pooling layer with a convolution kernel, which extracts feature information from the image.The decoder part performs deconvolution on the upsampling and then goes through a jump join to restore the image to the input size.
The Yellow River Zhengzhou section of the river is complex and variable, i.e. there are a large number of river core sandbars and small branching streams or seasonal rivers, so the direct use of the U-net network for extraction will lead to poor extraction of river edges and branching streams and the phenomenon of unextracted water bodies.

ASPP model setup
The ASPP (Atrous Spatial Pyramid Pooling) module is a combination of atrous convolution and spatial pyramidal pooling structure, which is can capture multi-scale semantic information by atrous convolution with different atrous rates (Chen, Papandreou, Kokkinos, Murphy, & Yuille, 2018).The study introduces ASPP into the U-net network model.Firstly, it can augment the receptive field of the convolution kernel, which is less damaging to the spatial resolution compared to the augmentation effect brought about by pooling operations, while not altering the relative positions of pixels; secondly, to capture multi-scale contextual information, ASPP superimposes different atrous rate modules simultaneously, bringing different overall and local multi-scale information due to different receptive fields; thirdly, ASPP can reduce the computational effort, as it does not require additional parameters compared to ordinary convolution, and the non-zero part is not computed during the convolution operation.

Activation functions
In a multilayer neural network, there is a functional relationship between the output of the neurons in the upper layer and the input of the neurons in the lower layer, and that function is the activation function.In this paper, the ReLU activation function (Rectified Linear Unit), which is a commonly used activation function in neural networks, is used, and its expression form is as follows: There are several reasons for using the ReLU function, one is that the ReLU function does not have a saturation zone, there is no problem of gradient disappearance, preventing gradient dispersion; two is that the ReLU function will be a part of the neuron output to 0, reducing the parameter interdependence, causing network sparsity and alleviating occurrence of overfitting problems; three is that the ReLU function does not have complex exponential operations, the calculation is simple and efficient.

Loss functions
The loss function is a non-negative real-valued function that measures the degree of difference between the predicted and true values of a model.The smaller the loss function, the better the robustness of the model.The softmax function is used to convert the output values of a multiclassification model into a probability distribution in the range [0, 1], and is often used to solve feature separation problems in multiclassification and image labeling.
The specific expressions are as follows: where zi is the output value of the ith node and C is the number of output nodes, i.e. the number of categories in the classification.
The Softmax loss function has inter-class separability and optimizes the effect of inter-class distances very well.The exponential function included in the Softmax function is easier to derive when solving gradients for parameter updates in deep learning back-propagation, and also pulls the numerical distance with a large gap to a larger distance.

Study area and data set
With the rapid development in deep learning, various datasets were developed for different features, but there is merely specific dataset for water bodies.This paper selects the Zhengzhou section of the Yellow River (34°49'-34°59'N, 112°4 2'-114°17'E) as the study area, with a total length of about 185 km.The scope of the study area is shown in Fig. 4.This reach has a wide, shallow channel its macrochannel width is commonly 1.5 to 10 km, including sandbars, forks and tributaries.The Yellow River in Zhengzhou, Henan Province, China, contains all the characteristics of the middle and lower reaches, and is a representative wandering segment of those regions.A total of 30 remote sensing images had been chosen to create the data set.The data source is Landsat7 and Landsat 8 remote sensing images, and the data comes from the United States Geological Survey (USGS).The study uses artificial visual interpretation to select different samples.Accuracy is essential while labelling the samples since convolutional neural networks learn pixel by pixel and the final results obtained are reliant on the pixel scale.The study uses a combination of NDWI and manual delineation to make a label map of data labels [0,1].900 pairs of remote sensing images, each with a resolution of 30m and a size of 512  512, make up the data set for the Yellow River's middle and lower reaches.In our experiments, the ratio of samples utilized for the training, verification and test sets is set to 8:1:1.This data set provides a variety of characteristic types of rivers in the middle as well as lower reaches of the Yellow River.

Network parameter settings
The following experimental parameters were established during training: The experiments in this paper used an SGD optimizer with a momentum of 0.9, the learning rate was set to 0.0001, and if the loss function did not decline after 100000 rounds of training, the learning rate decreased to 0.1 times the original, and the minimum value of the learning rate was 0.00001.The weight decay is 0.0005.The number of training iterations was 200,000.The data normalization was done with a 0.5 mean and standard deviation.

Design of evaluation indicators
Extracting water body from remote sensing images could be regarded as an application of semantic segmentation, in which water body pixels are positive samples and other pixels are negative samples.Therefore, the classification of all prediction results into four groups: True Positive (TP) indicates how many water body pixels were accurately identified.True Negative (TN) indicates how many other pixels were correctly identified.False Positive (FP) represents the number of other pixels misclassified as water body pixels.False Negative (FN) represents the number of water body pixels misclassified as other pixels.
To evaluate the performance of the proposed method, we use two typical metrics: The Mean Pixel Accuracy (MPA) and the Mean Intersection over Union (MIoU).MPA calculates the proportion of pixels correctly classified for each category, that is, the accuracy of the method; MIoU indicates the proportion between the intersection of the outcomes for each category predicted by the network model and the real values and the combined set, i.e. the predicted and true results for the water body overlap rate.Both are the higher the value, the higher the accuracy.The formulae are as follows.
where pij is the number of FNs, k+1 is the number of categories, and pii is the number of total pixels.

Experimental results
When the network model is trained, the size of the input image is 512  512 when using the test set for prediction,.In the experiment, U-net and AU-net are compared under the same operating environment and identical parameter settings.The NDWI water index method has been used as the representative of the traditional extraction method, the results of 0.04 threshold and 0.05 threshold were selected for comparison.

Fig.5. The results of river extraction
The quantitative results for the MPA and MIoU of all methods are summarized in Table 2.As Table 2 shows, the model proposed in this research paper is substantially better than the other three ways in both MPA and MIOU, with values of 0.992 and 0.972 respectively.It is easy to see that the MPA and MIoU of the network in this paper are enhanced by about 0.7 compared with the conventional extraction method; compared with the U-net value, it is increased by 0.093 and 0.091 respectively.Fig. 5. provides a visual representation of each method's performance on the middle and lower reaches of the Yellow River datasets, with a, b, c, d, and e determining five river reaches of different hydrological characteristics: (a)This is a simple section with a smooth river, no sandbars or forks, but a narrow river and a relatively small area.
(b) This complex morphological section, with its numerous dense river cores, unmistakable braided flows, and dispersed and continuous branching flows.It is typical of rivers in the middle and lower sections of the Yellow River.
(c) The map is shown for a region where nearby water bodies may be seen in the background.
(d) The diagrams have been selected for the presence of two rivers in the background, both of which are the Yellow River being extracted in this study, the Yiluo River may be seen in the lower right corner of the image, a tributary of the Yellow River.
(e) This is a thorough background of the river reach, meaning that the river background obtains the above four backgrounds.
When applied to the five river segments mentioned above, the NDWI method's extraction results achieve similar performance .At a threshold of 0.04, the river is complete and coherent, with tributaries, narrow rivers, and branching streams extracted well, but other non-water features have also been extracted incorrectly, which resulted in a high FP; the mixed characteristics are significantly diminished and the river can be seen at a threshold of 0.05, but the main river is discontinuous, tributaries, narrow rivers, as well as branching streams are not extracted, Additionally, the overall river information is lacking, leading to the low FN.Therefore, These problems lead to a low MPA and MIoU for the NDWI method.
The U-net network model was able to identify the larger, wider mainstem portion of the middle and lower reaches of the Yellow River(Column a and d of Fig. 4.) and accurately distinguish the feature of water bodies from non-water bodies.Therefore, the MPA and MIoU values of the u-net network model reach 0.896 and 0.880 respectively, which are higher than those of the NDWI method.However, as the u-net network model cannot capture multi-scale feature information, the extraction of branching streams, river cores or fine tributaries of the Yellow River rivers is moderate(Column b of Fig. 4.).In addition, the U-Net network cannot distinguish between river and non-river water features(Column c of Fig. 4.).The AU-Net network can not only extract the main rivers clearly and continuously, but also identify the forks and tiny rivers, which reduces the values of FP and FN, and makes the values of MPA and MIoU increase by 0.09 compared with U-Net.This suggests that the addition of the ASPP module can help capture multiscale water body information and improve the accuracy of semantic segmentation.
In summary, the AU-net network structure can improve the incomplete prediction and misclassification phenomenon for the small rivers in the middle and lower reaches of the Yellow River, and can segment a relatively high-quality predicted water body map; the interruptions in the relatively large rivers are improved, and the predicted water body targets are more complete, more accurate, with clearer boundaries, and can obtain considerable results.

Network model convergence
Network convergence means that the model is stable, that is, when a certain weight parameter of the model changes slightly, the output result of the model will not change strongly, and the result obtained is more stable.Figure 6 displays the LOSS plots for the AU-net and U-net network models.Smaller loss values indicate a better fit of the model; smoother loss curves emphasize more robustness of the model.As per the curve changes compared to the U-net network, the AU-net network is smoother and has smaller, more consistent loss values.As can be observed from the graph, during the training process, the U-net network primarily fluctuates up and down in the interval [0, 0.04] and it still has not converged after 200,000 iterations; the AU-net network fluctuates up and down at 0.01 at 25,000 iterations, so the network converges at 125,000 iterations.It can be noted that the AU-net network converges quicker than the U-net network.To sum up, the loss curve of the AU-net network model climbs a stable value earlier than the U-net and the loss curve is smoother, i.e., the AU-net network converges more quickly than the U-net network, and has a better fit and higher robustness.7.At a threshold of 0.05, the MPA and MIoU values were highest, 0.298 and 0.280 above the threshold of 0.03, and 0.095 and 0.108 above the threshold of 0.04.However, at a threshold of 0.05, the river was discontinuous and no branch and tributary streams were extracted.When the threshold is less than 0.05, the river is intact, but the non-water features are mixed and the river cannot be clearly identified.In summary, the traditional water body extraction method for the extraction of the middle and lower reaches of the Yellow River a very precise selection of the threshold value, which needs to be determined through repeated experiments.The results obtained by the traditional water body extraction method do not meet the needs of the study due to the complexity of the features and the influence of the spectrum.

CONCLUSIONS AND FUTURE WORK
In order to obtain more accurate segmentation results of the complex rivers of the Yellow River, this paper proposed an AU-NET network for automatic river extraction in the middle and lower reaches of the Yellow River.The proposed AU-net is designed based on the U-net network, and the ASPP module is introduced to solve the issue of lacking multi-scale features of the U-net network, and to improve the accuracy of semantic segmentation.Given the dataset of the middle and lower reaches of the Yellow River, the AU-Net network model provides accurate and complete identification of water bodies, and compare with the NDWI method and the U-net network model.The experimental results show that the rivers extracted by the AU-Net network are completer and more accurate, with clear boundaries.The results show that the ASPP model can effectively capture multi-scale semantic information, thereby improving the accuracy of semantic segmentation.
It should be noted that only the water bodies in the middle and lower reaches of the Yellow River are included in this data set, which means it may not fully represent the characteristics of water bodies located in other part of the Yellow River.In addition, the AU-Net network model has a good performance on the extraction of braided rivers, but it requires further verifications in extracting other water bodies such as lakes and oceans.In the future, our work aims to effectively expand the training data and explore the acquisition of remote sensing image information for multiple types of water bodies, so that the AU-net model can accurately extract information about various types of water bodies in different scenarios.

Fig. 1 .
Fig. 1.Overview of the AU-Net network architecture

Fig. 6 .
Fig. 6.Loss of network training stage 4.2 Threshold selection of the NDWI method NDWI method is a widely-used traditional water body extraction methods.Its water body extraction performance greatly relies on the selection of segmentation thresholds.In this part, we further verify the NDWI performance by varying threshold value from 0-0.05 in steps of 0.01.According to results in Fig.7.At a threshold of 0.05, the MPA and MIoU values were highest, 0.298 and 0.280 above the threshold of 0.03, and 0.095 and 0.108 above the threshold of 0.04.However, at a threshold of 0.05, the river was discontinuous and no branch and tributary streams were extracted.When the threshold is less than 0.05, the river is intact, but the non-water features are mixed and the river cannot be clearly identified.In summary, the traditional water body extraction method for the extraction of the middle and lower reaches of the Yellow River a very precise selection of the threshold value, which needs to be determined through repeated experiments.The

Fig. 7 .
Fig.7.The results of different thresholds by NDWI method

Table 1
Settings of each layer of the network

Table 2
Evaluation accuracy of the network model