LOW-RANK MATRIX DECOMPOSITION WITH SUPERPIXEL-BASED STRUCTURED SPARSE REGULARIZATION FOR MOVING OBJECT DETECTION IN SATELLITE VIDEOS

With new accessibility to satellite videos, retrieving the dynamic information of moving objects over a vast territory becomes possible with the development of advanced video processing and machine learning techniques. Detecting moving objects can be based on the structures of both background and foreground of a satellite video, and the background is assumed to lay in a low dimensional subspace. As the moving objects in satellite videos are groups of neighbouring pixels other than isolated pixels, Low-rank and Structured Sparse Decomposition (LSD) with structured sparsity regularization on the foreground can suppress the false alarms caused by isolated outliers. However, in LSD, the groups of neighbouring pixels are extracted by a fixed sliding window over each video frame, which ignores the coherence on the appearance of a moving object. For example, a moving object can be in an irregular shape and arbitrary orientation. In this paper, we argue that the spatial groups on the foreground can be defined using the concept of superpixels, where each superpixel is formed by a group of spatially connected similar pixels obtained from over-segmentation. We conduct low-rank matrix decomposition at superpixel level, which is named as Superpixel-based LSD (S-LSD). To handle the variation in moving objects, we combine the superpixels at a range of scales in the superpixel-based spatial regularization on the foreground. With the reduction in the number of spatial groups, S-LSD presents reduced computation complexity. The results on two satellite videos show a satisfactory performance with a significant saving in processing time when the proposed S-LSD approach is applied.


INTRODUCTION
Recently, the cube satellites Jilin-1 (Luo et al., 2017) and SkySat (Team, 2016) can produce satellite videos over a large territory. Unlike previous still images with low revisiting frequency, a satellite video is a sequence of 2-D spatial frames captured by the satellite with a high frame rate. The abundant temporal information in these videos is helpful for retrieve motion information on objects of interest over a larger territory, which facilitates a wide range of applications including target tracking (Mou, Zhu;Du et al., 2018;Zhang et al., 2018;Uzkent et al., 2018) and traffic monitoring (Kopsiaftis, Karantzalos). Detecting moving objects from satellite videos plays a vital role in these applications. Contemporary object detectors achieve state-of-the-art detection performance by learning a image-based detector from manually annotated training images (Long et al., 2017;Li et al., 2017;Ding et al., 2018;Liu et al., 2018). However, in satellite videos, the applicability of these approaches is limited by the accessibility to the sufficient annotations for training such over-parameterized models. Alternatively, unsupervised methods for Moving Object Detection (MOD) can separate moving objects from the background scene by making use of the temporal information.
The canonical approaches for MOD assume each frame in a video is constructed by a foreground and a background. The background part of a frame is considered temporally stable and similar, while the temporally changing foreground part contains the moving objects. Based on this assumption, the dominating * Corresponding author set of MOD approaches are based on the low-rank matrix decomposition, where the background data lay in a low dimensional subspace and the moving objects in the foreground are considered as the sparse outliers Bouwmans et al., 2017Bouwmans et al., , 2018. Robust Principle Component Analysis (RPCA), as a fundamental method in this set, imposes pixel-wise sparsity regularization term on the foreground in the low-rank matrix decomposition problem (Candès et al., 2011), whose solution can be obtained by Principle Component Pursuit (RPCA-PCP) (Lin et al., 2011;Candès et al., 2011;Wright et al., 2009) and Fast Low Rank Approximation (GoDec) (Zhou, Tao). However, RPCA is prone to the false alarms caused by the isolated outliers in satellite videos.
To suppress these false alarms, the spatial regularization terms are imposed on the foreground in low-rank matrix decomposition. Total Variation (TV) regularization is deployed to enforce the smoothness on the foreground in the matrix decomposition (Xu et al., 2017). The first-order Markov Random Field (MRF) is also integrated into low-rank matrix decomposition to constrain the moving objects to be contiguous (Zhou et al., 2013;Shakeri, Zhang). In satellite videos where spatial resolution is low and color information is limited, these approaches have limited improvement in MOD performance, as they risk merging neighbouring targets.
Another set of spatial prior on the foreground is defined on the sparsity over groups of spatial neighboring pixels other than independent pixels. The structured sparsity-inducing norm (Jenatton et al., 2011) is then introduced to regularize the foreground (Liu et al., 2015;Xu et al., 2013;Zhang et al., 2019a). In satellite videos, an Extended Low-rank and Structured Sparse Matrix Decomposition (E-LSD) model is proposed for boosting the MOD performance by imposing structured sparse regularization on the foreground (Zhang et al., 2019a,b). However, the spatial groups of neighbouring pixels in these approaches are extracted by a fixed sliding window over each video frame, which ignores the irregular shapes and arbitrary orientations of a moving object. Another disadvantage of using sliding window approach is that many spatial windows are processed unnecessarily, which leads to increased processing time.
In this paper, we argue that the spatial groups in the spatial regularization can be constructed from superpixels, where each superpixel is formed by a group of spatially connected similar pixels obtained from over-segmentation. In satellite videos, it is reasonable that we assume a moving object is commonly composed of one or more coherent regions and each of them can be extracted by over-segmentation. Inspired by this observation, we propose to conduct low-rank and structured sparse matrix decomposition with spatial groups defined by superpixels, which is named as Superpixel-based Low-rank and Structured Sparse Matrix Decomposition (S-LSD) in this paper. To handle the moving objects in various sizes, we also combine spatial groups from multiple sets of superpixels at a range of scales. In S-LSD, the number of spatial groups is less than it in LSD or E-LSD, which helps reduce the computation complexity of S-LSD in practice. We compared the proposed S-LSD with the state-of-the-art algorithms on two satellite videos and the experimental results validate the significant reduced processing time by S-LSD with satisfactory MOD performance.
The remainder of this paper is organized as follows. The proposed S-LSD is presented in Section 2. The experimental results and performance comparison against state-of-the-art approaches are presented in Section 3. Finally, conclusions is given in Section 4.

Problem Formulation
The proposed Superpixel-based LSD (S-LSD) is defined as a low-rank matrix decomposition problem, where the rank of the background is minimized. In order to suppress false alarms caused by isolated outliers in the foreground, S-LSD imposes a superpixel-based structured sparse regularization on the foreground.
Given a sequence of n video frames and each frame contains p pixels, S-LSD decomposes its corresponding matrix D ∈ R p×n to a low-rank background matrix B ∈ R p×n and a structured sparse foreground matrix S ∈ R p×n . The optimization problem of S-LSD is formulated as where Ω(S) refers to the spatial regularization term on the foreground S, and E is introduced to handle the noise in the model. λ1 and λ2 are the weights assigned to the spatial regularization term and the noise term, respectively.
In this paper, we assume the moving objects are sparse groups of neighboring non-zero pixels in the foreground, and the structured sparsity-inducing norm (Jenatton et al., 2011(Jenatton et al., , 2010Jia et al., 2012) is adopted to regularize the foreground as where G(s) refers to the set of spatial groups of neighboring pixels of the foreground s, and s |g ∈ R p is a sparse vector with non-zero elements at the indices represented in a group g ∈ G.
For each group of pixels g ∈ G(s), ηg is the weight for a group of the pixels. Applying the structured sparsity-inducing norm on the foreground data as the spatial regularization term tends to assign zeros to the pixels in a group, thus the isolated outliers on the foreground are suppressed. In S-LSD, no temporal relationship on the foreground is defined in Equation. (1), so the structured sparse penalty is frame-wise independent.

Superpixel-based Structured Sparse Regularization
In the spatial regularization term Ω(S), the groups of neighboring pixels G(s) are commonly constructed by the patches extracted by a fixed sliding windows over each frame (Liu et al., 2015;Xu et al., 2013;Zhang et al., 2019a,b), which leads to the increased number of generated spatial groups and hurts the efficiency in solving Equation.
(1). In this paper, we propose to build the spatial groups G(s) from the superpixels extracted by over-segmentation. With the superpixels extracted from a frame s at a given scale, a spatial group is constructed by the pixels in a superpixel. In order to handle the variation in moving objects, G(s) combines the superpixels extracted at a range of scales.
note a selected set of superpixel scales, and Gm(s) is referred to the groups constructed from the superpixels extracted at a given scale m ∈ M. Given a set of scales M, we define the entire set of spatial groups as With the spatial groups defined above, the proposed Superpixelbased Structured Sparse Regularization on a foreground frame s is defined as where ηg is the weight for a group of neighboring pixels. In this paper, we assign different weights to different superpixels. For spatial groups constructed from the superpixels at coarse scales, the small objects in the foreground would be suppressed, as the pixels in such groups may be forced to be zero. Based on this understanding, the weights for spatial groups at larger scales are decreased, where |Gm(s)| is the number of spatial groups in Gm(s), and η0 is the initial weight for the spatial regularization term. Since we always have |Gm 1 | > |Gm 2 | > · · · > |Gm |M| |, the weights of spatial groups at large scales are always less than η0. The same weight is assigned to the spatial groups constructed from the superpixels of the same scale. For simplicity, we assign 1.0 to the initial weight η0. In this paper, the superpixels are extracted by the SEEDS approach (Van den Bergh et al., 2015), where the initial size of a superpixel in SEEDS corresponds to the scale m in the Superpixel-based Structured Sparse Regularization.

Solution to S-LSD
To make the problem in Equation.
(1) more tractable, the nuclear norm B * , which is the convex relation of the Rank(B), is utilized to replace the rank minimization, and the linear constraint is removed by the Augmented Lagrangian method. Then
(6), since the sufficient guarantee on the convergence has been provided for the set of optimization problems that minimizes the sum of three functions with uncoupled variables under a three-block linear constraint (Cai et al., 2017;Zhang et al., 2019a).
As summarized in Algorithm 1, the procedure for solving Equation. (6) is to alternatingly solve sub-problems with respect to B, S and E with two remaining variables fixed until it is converged.
2.3.1 Update B At each iteration, B k+1 is updated by the Singular Value Thresholding approach (Wright et al., 2009;Cai et al., 2010).
where S 1 µ (Σ) conducts the element-wise soft-shrinkage on the diagnose matrix Σ by ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2020, 2020 XXIV ISPRS Congress (2020 edition)  (8) is decomposed to a set of frame-wise optimization problems, since the spatial regularization term on the foreground is framewise independent.
Given a frame d = Di, ∀i ∈ {1, · · · , n}, the decomposed optimization problem for the foreground frame s = S k+1 i is rewritten as arg min , and λ = λ 1 µ . When the spatial groups of pixels in G are non-overlapped, the problem in Equation.
(11) can be solved by the Group-LASSO method (Yuan, Lin). However, as G combines the spatial groups at different scales, overlapped groups of variables are observed, and Equation.
(11) cannot be solved directly. Instead, the solution is obtained by its dual problem as a Quadratic Min-cost Network Flow problem, ∀g ∈ G, ξ g 1 ≤ λ ηg and ξ g j = 0 if j / ∈ g, (12) where ξ ∈ R p×|G| is the dual variable. This Quadratic Min-cost Network flow problem is defined and solved in (Mairal et al., 2010). After solving the dual problem, the foreground Si = s is obtained by in which ξ * refers to the optimal solution to Equation. (12).

Computation Complexity
For processing a video in the length of n, the computation complexity for solving Equation.
(1) is related to the number of spatial groups in G, O(n(p 2 + g∈G |g|)). Compared with constructing G by sliding window, the superpixel-based spatial regularization usually has a reduced number of spatial groups. In case that the processing time is critical, S-LSD with spatial groups constructed at a single proper scale may achieve both reduced processing time and moderate MOD performance at the same time. Combining spatial groups of multiple scales in S-LSD may improve the MOD performance by handling moving objects in different sizes, which, on the contrary, may increase the processing time. In practice, by reducing the number of spatial groups, S-LSD can help reduce the time consumption for processing large satellite videos with satisfactory performance.

Dataset
The detection performance of S-LSD was evaluated on two satellite videos. They were captured over Las Vagas, USA on March 25, 2014, whose spatial resolution is 1.0 meter and the frame rate is 30 frames per second. Both videos contain 700 frames with boundary boxes for moving vehicles as groundtruth, and details on both videos are listed in Table. 1 1 .
The MOD performance on these videos is evaluated on recall, precision and F1 scores given by where T P denotes the number of correct detections, F N and F P are the numbers of missed detections and false alarms, respectively. In this paper, we define a correct detection with maximum Intersection over Union (IoU) against the groundtruth greater than a threshold. To accommodate the vehicles in small size in satellite videos, the threshold is set as 0.3 3 .

Selecting Proper Scales for Spatial Regularization
The scale of the spatial groups in the superpixel-based spatial regularization term plays an important role in S-LSD. We first  evaluate the performance of S-LSD with single scale spatial regularization. When selecting a small scale, a considerable number of spatial groups will be constructed in G, which thus increases the processing time. However, when selecting an enlarged scale, the small moving objects in the foreground may be suppressed, as pixels in a spatial group tends to be zero together. As presented in Figure. 2, when the scale of the spatial group increases from 4 to 64, the recall rate of the moving object drops from 90.3% to 71.7%. At the same time, as less spatial groups are constructed in G for larger scales, the processing time is reduced. In this paper, we combine spatial groups of different scales to handle moving objects in different sizes. As shown in Table. 2, when combining two scales of spatial groups in the regularization term, M = {4, 16}, the MOD performance is improved. When more scales are introduced, M = {4, 16, 64}, the MOD performance drops a bit, as some small moving objects may be improperly suppressed by the large spatial groups at the scale of 64. A moderate number of scales is recommended.
In the following experiments, we select S-LSD with a single scale with M = {4}, and, for S-LSD with multiple-scale spatial regularization, we set M = {4, 16}. The weights λ1 and λ2 are selected by cross validation, and further fine-tunes on M, λ1 and λ2 may improve the MOD performance by S-LSD more.

Comparison with Other Methods
To verify the effectiveness of S-LSD, we compare the detection performance against three state-of-the-art approaches, which are RPCA (Candès et al., 2011), LSD (Liu et al., 2015) and E-LSD (Zhang et al., 2019a). RPCA is a low-rank matrix decomposition method without spatial constraints on the foreground,and is solved by Principal Component Pursuit. LSD and E-LSD both impose the structured sparse regularization on the foreground, where the spatial groups are constructed by a sliding window.
As presented in Table. 3 and Figure. 4, S-LSD with the single scale spatial regularization M = {4} achieves comparable performance in term of recall with significantly reduced processing time. Compared with the RPCA, the superpixel-based structured sparse regularization in S-LSD helps reduce in false alarms due to noises in the data, which leads to improved detection precision. Compared with LSD and E-LSD where the structured sparse regularization is based on a sliding window, S-LSD reduces the processing time with comparable MOD performance. The reduction in time consumption by S-LSD shows it is more applicable for the applications where processing time is critical. S-LSD improves the detection precision to the extend that the remaining false alarms are caused by other factors. As presented in Figure. 4, moving objects are mistakenly recognized on the top of buildings in left bottom part of the video. Theses false alarms should be owing to the motion of the satellite in capturing the videos, and suppressing these false alarms is beyond the topic of paper.
When multiple-scale spatial regularization is imposed, S-LSD improves the detection precision with moderate increase in time consumption, as shown in Table. 3. Compared with LSD and E-LSD, S-LSD (M = {4, 16}) achieves the highest precision on both videos with lower time costs. As Video 002 contains fewer large moving vehicles, applying spatial regularization with larger scale may suppress small moving vehicles, which leads to the small drop in the recall rate by S-LSD. The difference scales are selected for different videos because the performance of S-LSD is related to the over-segmentation performance, which is affected by the complexity of video as well as the object size.

CONCLUSION
In this paper, we propose a Superpixel-based Low-rank and Structured Sparse Decomposition (S-LSD) algorithm for moving object detection, where superpixel-based structured sparse regularization is imposed on the foreground. We show that S-LSD with single-scale spatial regularization reduces the time consumption greatly with moderate detection performance, which makes it more applicable for processing large satellite videos. S-LSD with multiple-scale spatial regularization offers good detection performance, which is more suitable for application with high requirement for precision. With improved over-segmentation approaches for satellite videos, the MOD performance of S-LSD would be further improved.