SEMANTIC SEGMENTATION OF FOREST STANDS OF PURE SPECIES AS A GLOBAL OPTIMIZATION PROBLEM

Forest stand delineation is a fundamental task for forest management purposes, that is still mainly manually performed through visual inspection of geospatial (very) high spatial resolution images. Stand detection has been barely addressed in the literature which has mainly focused, in forested environments, on individual tree extraction and tree species classification. From a methodological point of view, stand detection can be considered as a semantic segmentation problem. It offers two advantages. First, one can retrieve the dominant tree species per segment. Secondly, one can benefit from existing low-level tree species label maps from the literature as a basis for high-level object extraction. Thus, the semantic segmentation issue becomes a regularization issue in a weakly structured environment and can be formulated in an energetical framework. This papers aims at investigating which regularization strategies of the literature are the most adapted to delineate and classify forest stands of pure species. Both airborne lidar point clouds and multispectral very high spatial resolution images are integrated for that purpose. The local methods (such as filtering and probabilistic relaxation) are not adapted for such problem since the increase of the classification accuracy is below 5%. The global methods, based on an energy model, tend to be more efficient with an accuracy gain up to 15%. The segmentation results using such models have an accuracy ranging from 96% to 99%.


INTRODUCTION
The analysis of forested areas from a remote sensing point of view can be performed at three different levels: pixel, object (mainly trees) or stand.When a joint mapping and statistical reasoning is required (e.g., land-cover (LC) mapping and forest inventory), forest stands remain the prevailing scale of analysis (Means et al., 2000, White et al., 2016).A stand can be defined in many different ways in terms of homogeneity: tree specie, age, height, maturity, and its definition varies according to the countries.Most of the time in national forest inventories, for reliability purposes, each area is manually interpreted by human operators using very high resolution (VHR) geospatial images with a near infra-red channel (Kangas and Maltamo, 2006).Among the large body of available remote sensing data today, airborne laser scanning (ALS) and Very High spatial Resolution (VHR) hyper/multispectral images are both well adapted and complementary inputs for stand segmentation (Dalponte et al., 2012, Dalponte et al., 2015, Lee et al., 2016).ALS provides a direct access to the vertical distribution of the trees and to the ground underneath.Hyperspectral and multispectral images are particularly relevant for tree species classification: spectral and textural information from VHR images can allow a fine discrimination of many species, respectively.Multispectral images are often preferred due to their higher availability, and higher spatial resolution.One should note that the literature remains focused on individual tree extraction and tree species classification, developing site-specific workflows with similar advantages, drawbacks and classification performance.Consequently, no operational framework embedding the automatic analysis of remote sensing data has been yet proposed in the literature for forest stand segmentation (Dechesne et al., 2017).More surprisingly, only few methods have addressed such an issue from a research perspective.More authors have focused on forest delineation (Eysn et al., 2012), that do not convey information about the tree species and their spatial distribution.
The analysis of the lidar and multispectral data is performed at three levels in (Tiede et al., 2004), following the hierarchy of the nomenclature of forest LC species database.The multi-scale analysis offers the advantage of alleviating the standard limitations of individual tree crown detection, and of retrieving labels related here to forest development stage.Nevertheless, the pipeline is highly heuristic and under-exploits lidar data.Besides, significant confusions between classes are reported.The automatic segmentation of forests in (Diedershagen et al., 2004) is also performed with lidar and VHR multispectral images.The idea is to divide the forests into higher and lower sections according to the height provided by the lidar sensor.An unsupervised classification process is applied and pre-defined thresholds enable to obtain the desired delineation of stands.This method is efficient if the canopy structure is homogeneous and requires a strong knowledge on the area of interest.Based on height information only, it cannot differentiate two stands of similar height but different species.In (Leppänen et al., 2008), a stand segmentation technique for a forest composed of Scots Pine, Norway Spruce and Hardwood is defined.A hierarchical segmentation on the Crown Height Model followed by a region growing procedure is performed on images composed of rasterized lidar data and Colored Infra-Red images.The process was only applied on a limited area of Finland, preventing from drawing broad conclusions.However, the quantitative analysis enhances again that lidar data can help to define statistically meaningful stands and that multispectral images are inevitable inputs for tree species discrimination.
Eventually, in (Dechesne et al., 2017), forest stand segmentation is considered from semantic segmentation point of view.Forest areas are first classified according to the tree species at the pixel and tree levels using lidar and multispectral airborne images.Then, the label map is smoothed using an energetic framework that integrates both lidar and optical features.
In this paper, we specifically focus on semantic segmentation of forest stands through the regularization/smoothing process of an existing label map of pure species, following the strategy proposed in (Dechesne et al., 2017).Therefore, we build upon the vast amount of literature dealing with tree species classification at the tree level and investigate how the combined use of airborne lidar and VHR multispectral images can provide more accurate label maps.Simple smoothing methods are first investigated as well as more complex energy formulations.We aim to determine whether a complex formulation of the problem helps to obtain better results in such non-structured environments.

How to smooth a label map?
Pixel-wise classification is not sufficient for both accurate and smooth land-cover mapping with VHR remote sensing data.This is particularly true in forested areas: the large intra-class and low inter-class variabilities of classes result in noisy label maps at pixel or tree levels.This is why various regularization solutions can be adopted from the literature (from simple smoothing to probabilistic graphical models, see Section 2.1).According to (Schindler, 2012), both local and global methods can provide a regularization framework, with their own advantages and drawbacks.In local methods, the neighborhood of each element is analyzed by a filtering technique.The labels of the neighboring pixels (or the posterior class probabilities) are combined so as to derive a new label for the central pixel.Majority voting, Gaussian and bilateral filtering can be employed if it is not targeted to smooth class edges.Global methods consider the full area of interest at the same time.They are based on Markov Random Fields (MRF), the labels at different locations are not considered to be independent.The optimal configuration of labels is retrieved when finding the Maximum A Posteriori over the entire field (Moser et al., 2013).The problem is therefore considered as the minimization procedure of an energy E over the full image I. Despite a simple neighborhood encoding (pairwise relations are often preferred), the optimization procedure propagates over large distances.Depending on the formulation of the energy, the global minimum may be reachable.However, a large range of optimization techniques allow to reach local minima close to the real solution, in particular for random fields with pairwise terms (Kolmogorov and Zabih, 2004).For genuine structured predictions, in the family of graphical probabilistic models, Conditional Random Fields (CRF) have been massively adopted during the last decade.Interactions between neighboring objects, and subsequently the local context can be modelled and learned.In particular, Discriminative Random Fields (DRF, (Kumar and Hebert, 2006)) are CRF defined over 2D regular grids, and both unary/association and binary/interaction potentials are based on labelling procedure outputs.Many techniques extending this concept or focusing on the learning or inference steps have been proposed in the literature (Kohli et al., 2009, Ladický et al., 2012).A very recent trend even consists in jointly considering CRF and deep-learning techniques for the labelling task (Kirillov et al., 2015).In standard LC classification tasks, global methods are known to provide significantly more accurate results (Schindler, 2012) since contextual knowledge is integrated.This is all the more true for VHR remote sensing data, especially in case of a large number of classes (e.g., 10, (Albert et al., 2016)), but presents two disadvantages.For large datasets, their learning and inference steps are expensive to compute.Furthermore, parameters should often be carefully chosen for optimal performance, and authors that managed to alleviate the latter problem still report a significant computation cost (Lucchi et al., 2011).

Semantic segmentation is a suitable solution
The classification process can be eased with segmentation techniques.Such algorithms provide strong local spatial supports (namely superpixels), sometimes at various scales (Lucchi et al., 2011).This is the so-called Object-Based Image Analysis (OBIA) framework.A pure bottom-up process is however not sufficient in our case.Alternatively, it can be achieved with more sophisticated top-down processes, e.g., based on pattern recognition methods but, emphasis is then put on localization of the objects of interest instead of sharp boundary retrieval (i.e., the reverse advantage of per-pixel classification).The best of both worlds is obtained with semantic segmentation, which aims to solve the interleaved issue of classification and segmentation by combining top-down and bottom-up cues.It defines the task of partitioning an image into regions that delineate meaningful objects and labelling those regions with an object label.While it is very popular in computer vision (Ladický et al., 2010, Boix et al., 2011, Arbeláez et al., 2012, Chen et al., 2015), it has been barely addressed in the remote sensing community (Montoya-Zegarra et al., 2015, Zheng andWang, 2015).Segmentation segmentation frameworks have demonstrated their usefulness in particular in structured environments such as urban areas.Emphasis is put on context learning in (Volpi and Ferrari, 2015) and on the design of robust yet locally discriminative modelling strategy for urban area classification of VHR multispectral images.It is based on a flexible energetical framework, namely a CRF.The adoption of fully-connected CRF can even allow to learn longer class interactions such as shown in (Li and Yang, 2016).Finally, semantic segmentation can be achieved using Deep Neural Networks, assuming the standard procedure is accompanied with proper deconvolution steps or with a Fully Connected Network such as in (Marmanis et al., 2016).In forested areas, the combined use of airborne lidar (for height structure) and VHR multispectral images (for species composition) into such a smoothing process would allow (i) to retrieve homogeneous patches, (ii) to define the homogeneity criterion/criteria and (iii) to control the level of generalization of the final label map.

General strategy
The proposed method assumes that a label map is provided for the areas of interest, and is accompanied with a class membership probability map, which provides, for each pixel of the image, the posterior class membership for all classes of interest.These are the necessary inputs for all methods described below.In practice, the strategy proposed in (Dechesne et al., 2017) is as followed: a supervised classification is performed on a selection of features extracted both from 3D lidar point clouds and aerial multispectral images.The training pixels are selected according to an existing forest LC geodatabase.The used classifier is the Random Forest (RF) classifier.This is an efficient classifier, that directly handles multiple classes, and provides posterior probabilities for each class.Here, both local and global methods are tested.For local techniques, majority voting and probabilistic relaxation are selected (Sections 3.2 and 3.3).For global methods, various energy formulations based on a feature-sensitive Potts model are proposed (Section 3.4).

Filtering
An easy way to smooth a probability map is to filter it.All the pixels in a r × r pixels moving window W are combined in order to generate an output label of the central pixel.The most popular filter is the majority filter.Firstly, the class probabilities are converted into labels, assuming that the label of pixel x is the label of the most probable class. (1) where nc is the number of classes.From this label image, the final smoothed result is obtained by taking the majority vote in a local neighborhood.
Many other filters have been developed but are not investigated in this paper.

Probabilistic relaxation
The probabilistic relaxation aims at homogenizing probabilities of a pixel according to its neighboring pixels.The relaxation is an iterative algorithm in which the probability at each pixel is updated at each iteration in order to have it closer to the probabilities of its neighbors (Gong and Howarth, 1989).It was adopted for simplicity reasons.First, good accuracies are reported with decent computing time, which is beneficial over large scales.Secondly, it offers an alternative to edge aware/gradient-based techniques that may not be adapted in semantically unstructured environments like forests.The probability P t k (u) of class k at a pixel u at the iteration t is defined by δP t k (u) which depends on: • The distance du,v between the pixel u and its neighbors v (the pixels that are distant of less than r pixels from u).
• A co-occurrence matrix T k,l defining a priori correlation between the probabilities of neighboring pixels.The local cooccurrence matrix has been tuned arbitrarily, but can also be estimated using training pixels (Volpi and Ferrari, 2015).The matrix is expressed as follow: The update factor is then defined as: In order to keep the probabilities normalized, the update is performed in two steps using the unnormalized probability Q t+1 k (u) of class k at a pixel u at the iteration t + 1: (5)

Global smoothing
The global smoothing method uses only a small number of pairwise cliques between neighboring pixels (4-neighbors or 8-neighbors) to describe the smoothness.Over the entire resulting first order random fields, the maximization of the posterior probability leads to a smoothed results.This can be done by finding the minimum of the negative log-likelihood, arg min where P (u) = [P (u, ci)|P (u, ci) ≥ P (u, cj)∀j], A(u) are the values of the features at pixel u (such as height, reflectance...) and Nu is the 8-connected neighborhood of the pixel u (only the 8-connected neighborhood is investigated in this paper).When γ = 0, the pairwise term has no effect in the energy formulation; the most probable class is attributed to the pixel, leading to the same result as the classification output.When γ = 0, the resulting label map becomes more homogeneous, and the borders of the segments/stands are smoother.However, if γ is too high, the small areas are bound to be merged into larger areas, removing a part of the useful information provided by the classification step.
The automatic tuning of the parameter γ has been addressed in (Moser et al., 2013) but is not used here.In this paper, two formulations of Edata (unary term) and four formulations of Epariwise (prior) are investigated.

Unary term
A widely used formulation for the unary term is the log-inverse formulation using the natural logarithm.It corresponds to the information content in information theory and is formulated as follow: It highly penalizes the low-probability classes but increase the complexity with potential infinite values.An other simple formulation for the unary term is the linear formulation, It penalizes less than the log-inverse formulation but has the advantage of having values lying in [0, 1].

Prior
In this work, the prior has a value depending on the class of neighboring pixels.In the four formulations, two neighboring pixels pay no penalty if they are assigned to the same class.Two basic and popular priors, the Potts model and the contrast-sensitive Potts model (called here z-Potts model), are investigated.In the Potts model, two neighboring pixels pay the same penalty if they are assigned to different labels, the prior for the Potts model is: In the z-Potts model, the penalty for a change of label depends on the gradient of height between two neighboring pixels.The z-Potts model is a standard contrast-sensitive Potts model applied to the height obtained from the point clouds.Here, since we are dealing with forest stands that are likely to exhibit distinct heights, the gradient of the height map (given with the 3D lidar point cloud) is computed for each of the four directions separately.The maximum Mg over the whole image in the four directions is used to compute the final pairwise energy.A linear function has been used: the penalty is highest when the gradient is 0, and decreases until the gradient reaches its maximum value.
The prior of the z-Potts model is therefore: where gu→v is the gradient between pixel u and pixel v, i.e., the absolute value of the height difference of the two pixels.An other pairwise energy investigated is a global feature sensitive energy (called here Exponential-features model).The pairwise energy is computed with respect to a pool of n features.When the features have close values, the penalty is high and decreases when the features tends to be very different.The pairwise energy in this case is expressed as follows: (11) where Ai(u) is the value of the i th feature of the pixel u.To compute such energy, the features need to be first normalized (i.e., zero mean, unit standard deviation) in order ensure that they all have the same dynamic.The last formulation investigated is also a global feature sensitive energy (called here Distance-features model).The pairwise energy is still computed with respect to a pool of n features.In this case, the energy is computed according to the distance between the two neighboring pixels in the feature space, the penalty is high when the pixels are close in the feature space and decrease when they get distant.The pairwise energy in this case is expressed as follow: with To compute such energy, the features need to be first normalized (i.e., zero mean, unit standard deviation) in order ensure that they all have the same dynamic.They are then rescaled between 0 and 1 to ensure that ||A(u); A(v)||n,2 lies in [0; 1] ∀(u, v).
In (Dechesne et al., 2017), a high number of features was extracted from available lidar and optical images (∼ 100) but can be selected.They can also be weighted according to their impor-tance, computed through the Random Forest classification process.Since the most important features ( 20) are almost all equally weighted, it does not bring additional discriminative information for the global feature sensitive energy.

Energy minimization
The energy minimization is performed using graph-cut methods.The graph-cut algorithm employed is the quadratic pseudo-boolean optimization (QPBO).
The QPBO is a popular and efficient graph-cut method as it efficiently solves energy minimization problems (such as the proposed ones) by constructing a graph and computing the min-cut (Kolmogorov and Rother, 2007).α-expansion moves are used, as they are an efficient way to deal with the multi-class problems (Kolmogorov and Zabih, 2004).

Data
The different smoothing methods have been conducted on 4 mountainous test areas.Each area has a surface of 1 km 2 .The IRC ortho-images of the test areas are presented in Figure 1.The proposed areas exhibit a large range of species (>4).The airborne multispectral images were captured by the IGN digital cameras (Souchon et al., 2012).They have 4 bands: 430-550 nm (blue), 490-610 nm (green), 600-720 nm (red) and 750-950 nm (near infra-red) at 0.5 m ground sample distance (spatial resolution).
The airborne lidar data were collected using an Optech 3100EA device.The footprint was 0.8 m in order to increase the probability to reach the ground.The point density for all echoes ranges from 2 to 4 points/m 2 .Our multispectral and lidar data fit with the standards used in many countries for large-scale operational forest mapping purposes (White et al., 2016).The multispectral images and the lidar data were acquired simultaneously.

Results
The results for all methods are presented for Area 2 in Table 1.
The overall accuracy is computed by comparing the labelled pixels in the forest LC, to the pixels of the output images.The filtering method performs the worse with a gain of less than 1% compared to the classification, even with large filters.Furthermore, the larger the filter is, the longer are the computation times.
The probabilistic relaxation has also poor results (+5% than the classification) and has also important computation times, since the iterative process runs until the convergence has been reached.Table 1.Results for the proposed methods for Area 2. The classification has an overall accuracy of 81.41%.
The best results for the four areas are presented in Figure 2. It shows that the proposed formulation is very efficient to retrieve forest patches of pure species with smooth borders.The accuracy ranges from 96% to 99%.Furthermore, the borders between adjacent classes fit well with the ones from the forest LC borders, which validates the relevance of our model.However, in areas where no data is available, it is hard to ensure that our model has relevant results, but, from a visual point of view, the results seem good.
The results for all the proposed models using the log-inverse unary are presented for Area 1 in Figure 3.It appears clearly that basic methods (such as filtering or probabilistic relaxation) are not adapted to our problem since the results remain very noisy.However, having a too binding unary term in the model also leads to noisy results.Even if the accuracy is higher than the accuracy using the linear unary, the small patches produced with the loginverse unary are not acceptable for a forest LC.
The effect of the parameter γ is presented in

CONCLUSION
The semantic segmentation of forest stands can be achieved by the fusion of ALS data and multispectral images.These two remote sensing modalities produce very satisfactory results since they both provide complementary observations.Good discrimination scores are already obtained with standard features and    classifier, which is a strong basis for an even more accurate delineation.This delineation can then achieved using several smoothing methods.It appears that a too simple smoothing model (such as filtering or probabilistic relaxation) is not sufficient in order to obtain consistent segments.A global smoothing method based on an energy model tends to be well adapted to the problem.A simple Potts model with a linear unary term provides excellent results.The model can be improved using the features used for the classification.Such model produces slightly better results, but also increases the complexity.However, having a too complex model (such as Exponential-features model with log-inverse unary) decreases the performance of the segmentation.In order to obtain homogeneous areas in terms of species with smooth borders, a global method based on a simple energy model is sufficient.

Figure 1 .
Figure 1.Orthoimages of the areas of interest.
The global smoothing methods have great results, increasing the accuracy up to 15%.The z-Potts model tends to have slightly worse results than the other methods.The Potts model and the Distance-features model have very close results regardless of the unary term.The Exponential-features model have the greatest results with the linear unary, but have poor results with the loginverse unary.It appears that it is the only energy that is sensitive to the unary term, indeed, for the Potts model, the z-Potts model and the Distance-features model, the difference between the linear unary and the log-inverse unary is less than 0.2%.

Figure 4 .
When γ is low, the borders are rough and small regions might appear (Figure 4(a)).Increasing γ smooth the borders, however, a too high value have a negative impact on the results, reducing the size of meaningful segments (Figure 4(c)) or even removing them (Figure 4(d)).The tuning of the parameter γ is an important issue, since different values of γ might be acceptable depending on the level of detail expected for the segmentation.In forest inventory, having small regions of pure species is interesting for the understanding of the behavior of the forest.For generalization purposes (such as forest LC), the segments must have a decent size and may exhibit variability.(a)γ = 5 (95.6%).(b) γ = 10 (96.36%).(c) γ = 15 (95.27%).(d) γ = 20 (94.09%).

Figure 4 .
Figure 4. Results of the Exponential-features model with linear unary for different values of γ for Area 2, the overall accuracy is specified in brackets.Color codes: deciduous oaks, fir or spruce, chestnut, robinia.

Figure 2 .
Figure2.Results for all the 4 areas, the overall accuracy is specified in brackets.The smoothing is performed using the Exponential-features model with linear unary (γ = 10).Color codes: beech, deciduous oaks, Scots pine, Douglas fir, fir or spruce, chestnut, robinia, larch, non-pectinated fir, black pine, herbaceous formation, no data.

Figure 3 .
Figure 3. Results of the different proposed models for Area 1, the overall accuracy is specified in brackets.Color codes: beech, deciduous oaks, Scots pine, Douglas fir, fir or spruce.