VEHICLE DETECTION IN REMOTE SENSING IMAGES USING DEEP NEURAL NETWORKS AND MULTI-TASK LEARNING

Vehicle detection in remote sensing image has been attracting remarkable attention over past years for its applications in traffic, security, military, and surveillance fields. Due to the stunning success of deep learning techniques in object detection community, we consider to utilize CNNs for vehicle detection task in remote sensing image. Specifically, we take advantage of deep residual network, multi-scale feature fusion, hard example mining and homography augmentation to realize vehicle detection, which almost integrates all the advanced techniques in deep learning community. Furthermore, we simultaneously address super-resolution (SR) and detection problems of low-resolution (LR) image in an end-to-end manner. In consideration of the absence of paired low-/highresolution data which are generally time-consuming and cumbersome to collect, we leverage generative adversarial network (GAN) for unsupervised SR. Detection loss is back-propagated to SR generator to boost detection performance. We conduct experiments on representative benchmark datasets and demonstrate that our model yields significant improvements over state-of-the-art methods in deep learning and remote sensing areas.


INTRODUCTION
Vehicle detection in remote sensing images has been widely applied in many fields and thus received much attention over past years. In spite of the tremendous efforts devoted to this task, the existing methods still require substantial improvement to address several challenges in this area. First, scale and direction variability make it more difficult to accurately locate the vehicle object. Second, complex background increases intraclass variability and interclass similarity. Third, some remote sensing images are captured in low resolution, which would definitely result in lacking sufficient detailed appearance to distinguish vehicle from similar objects. As shown in Figure 1, compared with the everyday images of vehicles, the remote sensing image captured from a perpendicular (or slightly oblique) viewpoint loses the 'face' of vehicle, and the vehicles typically display rectilinear structures. Thus, the presence of nonvehicle rectilinear objects such as trash bins, electrical units, air conditioning units on the tops of buildings, can complicate the task, causing many false alarms. Therefore, researchers are trying to exploit the state-of-the-art deep learning based object detection techniques to push the boundaries of the achievement in this regard.
Object detection can be split into two sub-tasks, localization and classification. Conventional methods addressing this problem are usually via three phases, image segmentation, feature extraction and training classifier. Particularly, saliency detection is utilized to generate region of interest (RoI) as positive samples. Then, low-level, handcrafted visual features (e.g., color histogram, texture, local pattern) are constructed on these samples to train classifiers (e.g., SVM, AdaBoost). However, due to complicate texture architecture and lacking pixel-level * Corresponding author annotation, positive training data could be noisy and thus degenerates the subsequent classifier. Furthermore, predefined manual features are usually computationally expensive and can't access to high-level semantic representation of objects, rendering the detection performance has much room for improvement.
Recently, convolutional neural networks (CNNs) exhibits strong feature learning capability and obtains state-of-the-art performance in a variety of classification and recognition tasks on benchmark datasets. Specific to the problem of object detection, great achievements have been made, which are usually driven by the success of region proposal methods and region-based CNNs, such as R-CNN (Girshick et al., 2014), Fast R-CNN (Girshick, 2015). On the basis of above networks, other advanced techno- Figure 2. Illustration of our CycleGAN-based vehicle detection module(CVDM). ILR is the input LR image, ISR is the super-resolved image from ILR, I LR is of LR generated from ISR. THR is the HR image provided as reference from other high-quality dataset. TLR is down-sampled version of THR. TSR is the super-resolved HR image from TLR. Colored arrows represent different parts in the whole framework. logies are employed to boost the detection performance. Feature pyramid network (FPN) (Lin et al., 2017) and Top-down module (TDM) (Shrivastava et al., 2016b) integrate multilayer features to cover objects with different scales. Deep residual networks are used as backbone ConvNet for better representations. Moreover, sample mining technique is applied to dig training data which makes more contributions to the optimization of networks. In remote sensing community, researchers have figured out many CNN-based methods for vehicle detection task. But few works make full use of all above advances in an unified framework, let alone any specific vehicle-oriented design that could be incorporated in CNNs to facilitate detection in remote sensing image.
Based on above observations, we focus on vehicle detection problem in remote sensing images. First, we take advantage of several advanced technologies in DNNs area to bridge the gap between deep learning and remote sensing vehicle detection communities. Particularly, we incorporate deep residual network ResNet50  for feature extraction, multiscale feature architecture to make accurate predictions and hard example mining to facilitate network optimization. Plus, we exploit homograph-based augmentation method to boost overall detection performance. Second, in order to alleviate the problem of low-quality image detection task, which refers to LR image in this work, we leverage on CycleGAN model and multitask learning, in which SR network is a generator and object detector is treated as a discriminator. Note that, due to lacking paired low-/high-resolution images, we investigate unsupervised learning regime for SR task. Our proposed framework is evaluated on several representative datasets and the results demonstrate that ours outperform state-of-the-art object detection approach Faster R-CNN++ in deep learning community, and other CNN-based methods in remote sensing area.

RELATED WORK
In this section, we are going to introduce several representative works in object detection and vehicle detection fields.

General object detection
Early works in object detection community mainly rely on handcrafted features (Dalal, Triggs, 2005), (Lowe et al., 1999) and then train classifiers (Felzenszwalb et al., 2009). Since AlexNet (Krizhevsky et al., 2012) got champion in ILSVRC-2012 competition (Deng et al., 2012), plenty of neural network architectures have been proposed and showed powerful learning capability in image classification task (Szegedy et al., 2015), (Szegedy et al., 2016), . On the basis of these works, R-CNN (Girshick et al., 2014) takes wrapped potential regions that are provided by region proposal methods (Uijlings et al., 2013) as input and extracts CNNs features, which are utilized to train class-specific liner SVMs. In order to avoid redundant computational cost in R-CNN, Fast R-CNN (Girshick, 2015) forwards entire image through network only once, imposing those seriously overlapped proposals to share computation. Plus, CNNs itself takes responsibility of classification and location regression. Thus the whole detection framework is modeled in an end-to-end manner. Region proposal network (RPN) is proposed to replace region proposal methods in Faster R-CNN (Ren et al., 2017), which significantly accelerates image processing. Later, FPN (Lin et al., 2017), Mask R-CNN  follow the fashion of Faster R-CNN pipeline, with improvements in multi-scale training, feature fusion, multi-task learning. Aforementioned CNN-based detection methods are two-stage, where class-agnostic proposals are provided and then refined in bounding box coordinates and classified into specific classes. Another typical solution for object detection is one-stage method in which proposals are predicted only once. YOLO (Redmon, Farhadi, 2018) and SSD (Liu et al., 2016a) are representatives of such trend.

Vehicle detection in remote sensing image
Traditional methods addressing vehicle detection problem rely on shallow-learning features. Here we discuss some representatives of them. Early work (Zhao, Nevatia, 2003) chooses the boundary of the car body, the shadow and the boundary of the front windshield as the characteristics to consider the change of view and shadow. The framework in (Liu et al., 2016b) applies Gauss process (GP) classification and gradient based segmentation algorithm (GSEG) to realize vehicle probability estimation of each pixel. Histogram of directional gradient feature descriptor (HOG) (Dalal, Triggs, 2005) and linear support vector machine (SVM) are used in (Bougharriou et al., 2017), (Madhogaria et al., 2015). Work (Kembhavi et al., 2010) uses color probability maps, pixel pairs and HOG to depict the color and geometric structure properties. In (Elmikaty, Stathaki, 2014), gradients map is computed to filter out non-vehicle regions. Multiple descriptors, Histogram of Oriented Gradients (HOG), Fourier and truncated Pyramid Colour Self-Similarity (tPCSS) of selected regions are combined to train a SVM.
Recently, CNNs become the hottest fashion for vehicle detection in remote sensing field. CNN-based detection model combining two independent convolutional neural networks was proposed in work (Zhong et al., 2017). In (Uus, Krilavičius, 2019), a unified framework is proposed on the basis of YOLO to realize airplane detection in aerial images. Similarly, YOLO-like architecture is used to detect aerial vehicles (Carlet, Abayowa, 2017), (Lu et al., 2018). Moreover, many region-based methods are conducted to detect smaller aerial vehicles. In (Kyrkou et al., 2018), sliding-window incorporated with prior knowledge is utilized to generate vehicle-like proposals. Subsequently, CNNs is employed to complete classification task. Work (Ji et al., 2019) investigates improved Faster R-CNN framework for vehicle detection. For efficiency, in (Chen et al., 2013) parallel CNN architecture is applied to extract features of ROIs and produce detection results. Variable sizes of convolutional filter and max-pooling field are adopted to extract variable-scale features for vehicle detection in (Chen et al., 2014).

THE PROPOSED SYSTEM
In this section, we elaborate the details of our proposed framework. The whole system consists of two modules, vehicle de- tection module (VDM) and CycleGAN-based vehicle detection module (CVDM). VDM follows region-based detection pipeline, whose architecture is shown in Figure 3. As shown in Figure 2, CVDM incorporates VDM into an CycleGAN-like architecture, which aims to address detection problem in LR image. Architecture of Detector is as VDM.

Vehicle detection module (VDM)
We model vehicle detection problem by region-based method, which forwards the entire image through a sequence of convolutional layers, extracts a set of feature maps corresponding to potential region proposals, and then produces detection results via two sibling branches. To train appropriate networks that complete these sub-tasks in an end-to-end fashion, our approach is composed of three main components. First, basic convolutional network generates feature maps. Second, hierarchical architecture constructs multilevel representations and predictions. Third, online hard example mining technique digs discriminative samples.
3.1.1 ConvNet and multilevel feature architecture Deep residual networks have proved to be effective for feature learning and achieved remarkable success in object detection task. Thus, we utilize ResNet50 for feature extraction. However, we observe that for remote sensing image, vehicle object may occupy relatively small area in the whole image. The stride of ResNet50 is 32, which results in losing vehicle information in the process of convolution and pooling operations. Commonly used countermeasure for this case is mutiscale training and testing, which is obviously time consuming and cannot guarantee the features are interpretable enough for final detection. To alleviate this problem, many feature fusion strategies are proposed to build hierarchical architecture and make predictions in multiple feature levels. As FPN obtains state-of-the-art results on canonical benchmark datasets, we adopt FPN-like architecture to construct appropriate multilevel features for our task.
ResNet50 has 5 blocks (each block consists of several convolutional layers, namely c1, c2, c3, c4, c5). For building semantic representation, feature maps of each block are filtered, upsampled and merged with previous block. Finally, there are 5 stages in proposed multilevel neural network, namely p2, p3, p4, p5. Original RPN generates region proposal on the last block of convolutional layers. FPN performs region proposal operation on each stage as well as another additional stage p6 at last for covering objects from 32 2 to 512 2 . Taking into account of the size range of vehicle objects, we only utilize p2,p3,p4 for region proposal. Table 1 gives detail comparisons between naive RPN, FPN and proposed multilevel architecture. At the phase of subsequent detection network, original RPN projects RoIs on c5, while FPN on each stage based on areas of the RoIs. Due to vehicle objects generally with small size, we extract feature maps of proposal on the finest stage p2, which provides discriminative representations for classification and location.
3.1.2 Sub-detection network Feature maps of each proposal are pooled to 7 × 7 bins, followed by two continuous fully connected layers. Then the outputs are forwarded to sibling branches for classification and localization. For classification task, we apply standard multi-class cross entropy loss which can be formulated as Equation 1: Here, i is the index of a region in a mini-batch and pi is the predicted probability of region i being a vehicle. The groundtruth label l * i is 1 if the region is positive, and is 0 otherwise. t i is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t * i is that of the ground-truth box associated with a positive anchor. For more detailed discussion of this objective function and the recommended value of parameters, readers can refer to (Ren et al., 2017).
3.1.3 Online hard example mining Data mining aims to dig out samples that are not distinguishable enough for training and thus make the classifier more discriminative progressively. Especially in remote sensing area, background information is usually complex, implying high similarity between positive and negative samples. Randomly selecting training samples would miss useful information. Consequently, proceeding samples from simple to complex is proposed for solving this problem. Specifically, researchers make use of alternative learning strategy, incorporating influential samples gradually, totally training the classifier for several rounds. Selection criteria depends on confidence of previous detection model. We call this mining approach as offline manner. Later, in (Shrivastava et al., 2016a), researchers consider to complete this task with online manner and successfully embed the algorithm into Faster R-CNN, namely online hard example mining. In this manner, during each training forward pass, those proposals with high loss value are selected as hard examples to back-propagation for estimating weights. In implementation, backbone ConvNet is followed by two sub-detection networks, called readonly and standard modules respectively. The former branch is responsible for calculating loss value of proposals and the latter branch accounts for standard SGD operation combined with basic Con-vNet. Readers can access detailed information of this algorithm in (Shrivastava et al., 2016a). We display some hard examples obtained of our implementation in Figure. 4.

CycleGAN-based vehicle detection module (CVDM)
In this section, we focus on detection task in LR image by simultaneous SR operation and object detection. Commonly solution for the problem is directly upsampling image by bicubic kernel, which definitely loses appearance details. Thus, we exploit SR method to enhance LR image. Existing methods model this problem with fully convolutional network (FCN) and pixel-level annotation, paired low-/high-resolution images, are essential for these models. However, in practise, it's difficult to obtain paired training data. To ease the burden of data collection, unsupervised learning regime is developed for domain translation. Our approach is inspired by two representatives, CycleGAN (Zhu et al., 2017) and Cycle-in-Cycle (Yuan et al., 2018), which realize unsupervised image translation by GAN. Our framework consists of generator and discriminator, in which GS super-resolves LR image, GL restores obtained SR image back to LR domain, Detector realizes vehicle detection. Figure 5, there are two generators for image SR component, where ILR, I LR represent original LR image and its restored counterpart respectively, ISR is corresponding super-resolved one. Cycle consistency loss is:
In order to preserve the color and quality of super-resolved image, we add identity loss to train the whole model. Its formulation can be seen in Equation 3, which also uses MSE loss. As we can't access to paired images in our target remote sensing data, here we refer to dataset that is for SR purpose and not related to our target data. THR means high-resolution reference, while TLR represents its LR counterpart, which is downsampled by bicubic kernel.
We utilize adversarial loss for GS and its discriminator D, which aims to distinguish high-resolution image THR from generated one ISR. We present the objective as: LGAN Now, the objective for SR module is: Where λ1 and λ2 control the importance of consistency loss and identity loss in the whole model. The architectures of above two generators are shown in Table 2 and Table 3. Discriminator D is displayed in Table 4.

Discriminator network Detector
We embed proposed VDM as a discriminator in our CycleGAN-based framework, which takes generated ISR as input and outputs detection result of vehicle object. So our CVDM is modeled in multi-task learning fashion, including super-resolution and object detection. Here, taking into account of the relationship between the    two tasks, we back-propagate detection loss to SR network, which guides the generator to produce image that is beneficial for detection purpose. In summary, the overall objective for CVDM is: where LDet is detection loss, the same as Equation 1. Its loss weight is λ3.

Implementation details
We first train VDM network, whose backbone is initialized by ResNet50 trained on ImageNet classification task. The model is trained by SGD optimizer and totally trained for 60k iterations. Initial learning rate is set to 2.5e-3 and reduced to 2.5e-4 after 40k iterations. Next, we train CycleGAN-based SR model, namely CycGANSR. GS is initialized by the model released from (Lim et al., 2017). GL and D are trained from scratch. All the networks are trained with Adam optimizer apart from Detector. Moreover, their initial learning rate is set to 1e-4 and reduced to 1e-5 after 40k iterations. The batchsize is 2 and the networks are totally trained for 80k iterations. As it's difficult to optimize generator and discriminator simultaneously, we leverage on alternative learning strategy. When training generators, the parameters of discriminator are fixed and objective function is shown as Eq (7), just without the classification loss (4th term) and localization loss (5th term). Here λ1 and λ2 are both set to 1. For training discriminator, we fix the generators and the objective function is shown as Eq (8), but without detector loss.
arg min arg min After training CycGANSR network and detection network, we train them jointly. Its training procedure is the same as CycGANSR and its objective functions are as Eq (7) and Eq (8) respectively. λ3, ω are set to 0.01 and 0.1 respectively. For VDM, the scale of images for training is 800 × 800. For CVDM, input image is 200 × 200. Upsampling factor is 4.

EXPERIMENTS
In this section, we first introduce experiment setup, including data preparation and augmentation. Then we present results and compare ours with other state-of-the-art methods. * This column indicates the resolution of input testing image. HF means the input includes its horizontal flipping. FASR represents Faster R-CNN++. RVD, FVD, YVD, DVD represent works (Zhong et al., 2017), (Carlet, Abayowa, 2017), (Lu et al., 2018), (Uus, Krilavičius, 2019) respectively Table 5. Results on VEDAI dataset AGRC image collection, with 12.5cm GSD. We choose its halfresolution version of 512×512. Thus, the vehicle in this set is smaller than other datasets, and the car is typically about 10×8 pixels. 3) DLR Munich dataset (Liu, Mattyus, 2015) is captured at about 1000m above the ground over the area of Munich, Germany, using DLR 3K camera system. There are totally 20 images (of resolution 5616×3744 pixels), with approximate 13cm GSD. 4) UCAS-AOD dataset (Zhu et al., 2015) consists of 510 satellite images with resolution 659×1280 pixels, including 410 training images and 100 testing ones. Due to influence of environment and equipment, the sizes of vehicles in this dataset are usually larger than that of the Munich dataset.
However, the quality of this dataset is much poorer.
We apply average precision (AP) and mean recall rate (mRecall, which is mean value of the recalls from IoU 0.5 to 0.95, with 0.05 stride) for comparing ours with other methods in the deep learning community. Furthermore, F1 score, precision and recall (with IoU 0.5) are adopted for comparison with method in remote sensing area. Their definitions are: recall = T P T P + F N (9) precision = T P T P + F P (10) where T P , F P ,F N represent true positive, false positive and false negative respectively.

Homography-based data augmentation
We define the ground as a plane, which is usually not perpendicular to the main optical axis of camera. Thus, it makes deformation of vehicle targets and difficulty for detection task. To alleviate this problem, we apply homography transformation on training data to simulate the data captured from a more oblique viewpoint. We display some examples in Figure 6. Each image is rotated along x, y axis, with rotation angle −15 • and 15 • respectively. Given the rotation angle, the homography matrix can be estimated to calculate the coordinates of transformed bounding box. As shown in Figure 6( Table 6. Results on Potsdam dataset Figure 7. Examples of the remote sensing images from VEDAI Munich (1st row), Potsdam (2nd row), DLR (3rd row), UCAS-AOD (4th row) datasets respectively. The vehicle detection results of our method are marked with green boxes. Blue and red boxes indicate missing and false alarm respectively. R-CNN by replacing VGG16 ConvNet with ResNet50 and utilizing FPN architecture for feature fusion) and YOLOv3, which are representatives in two-stage and one-stage trends respectively. Notice that we use advanced version of Faster R-CNN for fair comparison. In remote sensing community, we directly report the results that on such datasets. We clarify that YOLOv3 is trained with multi scales (including 600 × 600), while other results are trained with 800 × 800.
In Table 5 and 1st row of Figure 7, we report and display the results on VEDAI dataset. We first discuss results inferred on training scale level 800×800, where ours outperforms YOLOv3 and Faster R-CNN++ by 13 and 3 points in AP, respectively. On higher IoU threshold 0.75, result of YOLOv3 reduces sharply to 18.9%, in comparison with proposed VDM 40.3% and Faster R-CNN++ 37.2%, which implies region-based methods are more robust to small vehicles. mRecall value also verifies this conclusion. As augmented testing can boost overall performance, we apply horizontal flip augmentation on testing data. It can be seen that results of VDM and Faster R-CNN++ are both im-

Input info
Recall Precision F1-score (Audebert et al., 2017) 800 Especially the latter dropped more than 15% points for all metrics when testing on 1200 × 1200 scale. Reversely, our results are much better than before. mRecall rate of the last row exceeds that of Faster R-CNN++ by more than 20 points, which fully explains the robustness and generality of our VDM.
In Table 6, 7 and 2nd row of Figure 7, we report and display the results on Potsdam dataset. As the GSD of Potsdam is the smallest, the objects have the best appearance quality compared to other datasets, and all methods report better results. However, apart from a small drop at AP with 0.5 IoU threshold, our method achieves best results on other three metrics. For augment testing, proposed VDM is still robust, when Faster R-CNN++ behaves badly.

Results of CVDM
We conduct experiments on Munich DLR and UCAS-AOD datasets. To well illustrate proposed CVDM, we downsample the training image to 200 × 200 as input of our model. Here, we also compare ours with R-FCN (Dai et al., 2016) and SSD (Liu et al., 2016a), which are both competitive methods in one-stage filed. Table 8 and 9 show results on Munich DLR and UCAS-AOD datasets. First, we test the detection performance of R-FCN, SSD, YOLOv3, and Faster R-CNN++ on the input LR images (without any SR operation), and the results are poor, which demonstrates that low-quality image limits detection performance, both on onestage and two-stage methods. Next, we study the influence of different upsampling methods, bicubic interpolation, EDSR (pretrained model), CycleGAN-based SR (which does not incorporate detector) and our CVDM.  Table 9. Results on UCAS-AOD dataset tection results on these two datasets are shown in row 3 and 4 of Figure 7.
In this work, we implement our experiments on PyTorch and NVIDIA GeForce GTX1080Ti with 12 GB on-board memory.

CONCLUSION
In this paper, we have investigated advanced deep learning techniques, which include better backbone ConvNet, multilevel feature fusion and sample mining, to realize vehicle detection in remote sensing image. Homography data augmentation is proposed to address multi-angle problem in data collection stage. Furthermore, we leverage on CycleGAN-like architecture to realize simultaneous SR and object detection for LR image, where SR task relies on unsupervised learning regime and is guided by detection task. Our experiments show that our system surpasses state-of-the-art methods. In future, we plan on realizing instance segmentation of vehicle in remote sensing image.